05-Querying-JSON(Python)

Querying JSON & Hierarchical Data with DataFrames

Apache Spark™ and Azure Databricks® make it easy to work with hierarchical data, such as nested JSON records.

Getting Started

Run the following cell to configure our "classroom."

%run "./Includes/Classroom-Setup"

Examining the Contents of a JSON file

JSON is a common file format used in big data applications and in data lakes (or large stores of diverse data). File formats such as JSON arise out of a number of data needs. For instance, what if:

  • Your schema, or the structure of your data, changes over time?
  • You need nested fields like an array with many values or an array of arrays?
  • You don't know how you're going use your data yet, so you don't want to spend time creating relational tables?

The popularity of JSON is largely due to the fact that JSON allows for nested, flexible schemas.

This lesson uses the DatabricksBlog table, which is backed by JSON file dbfs:/mnt/training/databricks-blog.json. If you examine the raw file, notice it contains compact JSON data. There's a single JSON object on each line of the file; each object corresponds to a row in the table. Each row represents a blog post on the Databricks blog, and the table contains all blog posts through August 9, 2017.

%fs head dbfs:/mnt/training/databricks-blog.json

Create a DataFrame out of the syntax introduced in the previous lesson:

databricksBlogDF = spark.read.option("inferSchema","true").option("header","true").json("/mnt/training/databricks-blog.json")

Take a look at the schema by invoking printSchema method.

databricksBlogDF.printSchema()
root |-- authors: array (nullable = true) | |-- element: string (containsNull = true) |-- categories: array (nullable = true) | |-- element: string (containsNull = true) |-- content: string (nullable = true) |-- creator: string (nullable = true) |-- dates: struct (nullable = true) | |-- createdOn: string (nullable = true) | |-- publishedOn: string (nullable = true) | |-- tz: string (nullable = true) |-- description: string (nullable = true) |-- id: long (nullable = true) |-- link: string (nullable = true) |-- slug: string (nullable = true) |-- status: string (nullable = true) |-- title: string (nullable = true)

Run a query to view the contents of the table.

Notice:

  • The authors column is an array containing one or more author names.
  • The categories column is an array of one or more blog post category names.
  • The dates column contains nested fields createdOn, publishedOn and tz.
display(databricksBlogDF.select("authors","categories","dates","content"))
["Tomer Shiran (VP of Product Management at MapR)"]["Company Blog","Partners"]{"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at MapR, announcing our new partnership to provide enterprise support for Apache Spark as part of MapR's Distribution of Hadoop.</div> <hr /> With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution. These customers range from the world’s largest cable telcos and retailers to Silicon Valley startups such as Quantifind, which recently talked about its use of Spark on MapR in an <a href="http://www.datameer.com/ceoblog/big-data-brews-with-erich-nachbar/" target="_blank">interview</a> with Stefan Groschupf, CEO of Datameer. Today, I a...
["Tathagata Das"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"}We are happy to announce the availability of <a href="http://spark.apache.org/releases/spark-release-0-9-1.html" target="_blank">Apache Spark 0.9.1</a>! This is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release since Spark graduated as a top level Apache project. Contributions to this release came from 37 developers. Visit the <a href="http://spark.apache.org/releases/spark-release-0-9-1.html" target="_blank">release notes</a> for more information about all the improvements and bug fixes. <a href="http://spark.apache.org/downloads.html" target="_blank">Download</a> it and try it out!
["Steven Hillion"]["Company Blog","Partners"]{"createdOn":"2014-04-01","publishedOn":"2014-04-01","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at Alpine Data Labs, part of the 'Application Spotlight' series highlighting innovative applications that are part of the Databricks "Certified on Apache Spark" program.</div> <hr /> Everyone knows how hard it is to recruit engineers and data scientists in Silicon Valley. At <a href="http://www.alpinenow.com" target="_blank">Alpine Data Labs</a>, we think what we’re up to is pretty fun and challenging, but we still have to compete with other start-ups as well as the big internet companies to attract the best talent. One thing that can help is to be able to say that you’re working with the most innovative and powerful technologies. Last year, I was interviewing a talented engineer with a strong background in machine learning. And he said that the one thing he wanted to do above all was to work with Apache Spark. “Will I get to do that at Alpine?” he asked. If it had been even a year earlier, I would have said “Sure…at...
["Michael Armbrust","Reynold Xin"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-03-27","publishedOn":"2014-03-27","tz":"UTC"}Building a unified platform for big data analytics has long been the vision of Apache Spark, allowing a single program to perform ETL, MapReduce, and complex analytics. An important aspect of unification that our users have consistently requested is the ability to more easily import data stored in external sources, such as Apache Hive. Today, we are excited to announce <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html">Spark SQL</a>, a new component recently merged into the Spark repository. Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark’s distributed datasets) and in external sources. Spark SQL conveniently blurs the lines between RDDs and relational tables. Unifying these powerful abstractions makes it easy for developers to intermix SQL commands querying external data with complex analytics, all within in a single application. Concretely, Spark SQL will allow developers to: <ul> <li>I...
["Patrick Wendell"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-02-04","publishedOn":"2014-02-04","tz":"UTC"}Our goal with Apache Spark is very simple: provide the best platform for computation on big data. We do this through both a powerful core engine and rich libraries for useful analytics tasks. Today, we are excited to announce the release of Apache Spark 0.9.0. This major release extends Spark’s libraries and further improves its performance and usability. Apache Spark 0.9.0 is the largest release to date, with work from 83 contributors, who submitted over 300 patches. Apache Spark 0.9 features significant extensions to the set of standard analytical libraries packaged with Spark. The release introduces GraphX, a library for graph computation that comes with implementations of several standard algorithms, such as PageRank. Spark’s machine learning library (MLlib) has been extended to support Python, using the NumPy numerical library. A Naive Bayes Classifier has also been added to MLlib. Finally, Spark Streaming, which supports near-real-time continuous computation, has added a simplif...
["Ali Ghodsi","Ahir Reddy"]["Apache Spark","Ecosystem","Engineering Blog"]{"createdOn":"2014-01-02","publishedOn":"2014-01-02","tz":"UTC"}Apache Hadoop integration has always been a key goal of Apache Spark and <a href="http://hortonworks.com/wp-content/uploads/2013/06/YARN.png">YARN</a> users have long been able to run <a href="http://spark.incubator.apache.org/docs/latest/running-on-yarn.html">Spark on YARN</a>. However, up to now, it has been relatively hard to run Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically, users would have to get permission to install Spark/Scala on some subset of the machines, a process that could be time consuming. Enter <a href="http://databricks.github.io/simr/">SIMR (Spark In MapReduce)</a>, which has been released in conjunction with <a href="https://databricks.com/blog/2013/12/19/release-0_8_1.html">Apache Spark 0.8.1</a>. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights, and without having Spark or Scal...
["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"]["Company Blog","Customers"]{"createdOn":"2014-03-26","publishedOn":"2014-03-26","tz":"UTC"}<div class="post-meta">We're very happy to see our friends at Cloudera continue to get the word out about Apache Spark, and their latest blog post is a great example of how users are putting Spark Streaming to use to solve complex problems in real time. Thanks to Russell Cardullo and Michael Ruggiero, Data Infrastructure Engineers at <a href="http://engineering.sharethrough.com/">Sharethrough</a>, for this <a href="http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/">guest post on Cloudera's blog</a>, which we've cross-posted below</div> <hr /> At Sharethrough, which offers an advertising exchange for delivering in-feed ads, we’ve been running on CDH for the past three years (after migrating from Amazon EMR), primarily for ETL. With the launch of our exchange platform in early 2013 and our desire to optimize content distribution in real time, our needs changed, yet CDH remains an important part of our infrastructure. In mid-2013, we began to examine stream-ba...
["Jai Ranganathan","Matei Zaharia"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-03-21","publishedOn":"2014-03-21","tz":"UTC"}<div class="post-meta"> This article was cross-posted in the <a href="http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/">Cloudera developer blog</a>. </div> <a href="http://spark.apache.org/">Apache Spark</a> is well known today for its <a href="http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/">performance benefits</a> over MapReduce, as well as its <a href="http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/">versatility</a>. However, another important benefit — the elegance of the development experience — gets less mainstream attention. In this post, you’ll learn just a few of the features in Spark that make development purely a pleasure. <h2>Language Flexibility</h2> Spark natively provides support for a variety of popular development languages. Out of the box, it supports Scala, Java, and Python, with some promising work ongoing <a href="http:/...
["Databricks Press Office"]["Announcements","Company Blog"]{"createdOn":"2014-03-19","publishedOn":"2014-03-19","tz":"UTC"}<strong>BERKELEY, Calif. – March 18, 2014 –</strong> Databricks, the company founded by the creators of Apache Spark that is revolutionizing what enterprises can do with Big Data, today announced the Databricks <a href="/certification/">“Certified on Spark” Program</a> for applications built on top of the Apache Spark platform. This program ensures that certified applications will work with a multitude of commercially supported Spark distributions. “Pioneering application developers that are leveraging the power of Spark have had to choose between two sub-optimal choices: they either have to package Spark platform support with their application or attempt to maintain integration/certification individually with a rapidly increasing set of commercially supported Spark distributions,” said Ion Stoica, Databricks CEO. “The Databricks ‘Certified on Spark’ program enables developers to certify solely against the 100% open-source Apache Spark distribution, and ensures interoperability with A...
["Ion Stoica"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-03-03","publishedOn":"2014-03-03","tz":"UTC"}<div class="blogContent"> We are delighted with the recent <a href="https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces50">announcement</a> of the Apache Software Foundation that <a href="http://spark.apache.org">Apache Spark</a> has become a top-level Apache project. This is a recognition of the fantastic work done by the Spark open source community, which now counts over 140 developers from 30+ companies. In short time, Spark has become an increasingly popular solution for numerous big data applications, including machine learning, interactive queries, and stream processing. Spark now is an integral part of the Hadoop ecosystem, with many organizations employing Spark to perform sophisticated processing on their Hadoop data. At Databricks we are looking forward to continuing our work with the open source community to accelerate the development and adoption of Apache Spark. Currently employing the lead developers and creators of many of the components...
["Ahir Reddy","Reynold Xin"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-02-13","publishedOn":"2014-02-13","tz":"UTC"}The AMPLab at UC Berkeley, with help from Databricks, recently released an update to the <a href="https://amplab.cs.berkeley.edu/benchmark/">Big Data Benchmark</a>. This benchmark uses Amazon EC2 to compare performance of five popular SQL query engines in the Big Data ecosystem on common types of queries, which can be reproduced through publicly available scripts and datasets. In the past year, the community has invested heavily in performance optimizations of query engines. We are glad to see that all projects have evolved in this area. Although the queries used in the benchmark are simple, we are proud that Shark remains one of the fastest engines for these workloads, and has improved significantly since the last run. While this benchmark reaffirms Shark as a highly performant SQL query engine, we are working hard at Databricks to push the boundaries further. Stay tuned for some exciting news we will share soon with the community. <ul> <li><a href="https://amplab.cs.berkeley.edu/b...
["Pat McDonough"]["Company Blog","Events"]{"createdOn":"2014-02-11","publishedOn":"2014-02-11","tz":"UTC"}The Databricks team is excited to take part in a number of activities throughout the 2014 O’Reilly Strata Conference in Santa Clara. From hands-on training, to office hours, to several talks (including a keynote), there are plenty of chances for attendees to learn how Apache Spark is bringing ease of use and outstanding performance to your big data. The schedule for the Databricks team includes: <ul> <li><a href="http://ampcamp.berkeley.edu/4/">AMPCamp4</a>, Hosted at Strata</li> <li><a href="http://strataconf.com/strata2014/public/content/office-hours">Office Hours</a> on Wednesday at 5:45pm</li> <li><a href="http://strataconf.com/strata2014/public/schedule/detail/33057">How Companies are Using Spark, and Where the Edge in Big Data Will Be</a>, a keynote talk presented by Matei Zaharia on Thursday at 9:15am</li> <li><a href="http://strataconf.com/strata2014/public/schedule/detail/32375">Querying Petabytes of Data in Seconds with BlinkDB</a>, co-presented by Reynold Xin on Thur...
["Ion Stoica"]["Apache Spark","Ecosystem","Engineering Blog"]{"createdOn":"2014-01-22","publishedOn":"2014-01-22","tz":"UTC"}We are often asked how does <a href="http://spark.incubator.apache.org">Apache Spark</a> fits in the Hadoop ecosystem, and how one can run Spark in a existing Hadoop cluster. This blog aims to answer these questions. First, Spark is intended to <em>enhance</em>, not replace, the Hadoop stack. From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks. Second, we have constantly focused on making it as easy as possible for <em>every Hadoop user</em> to take advantage of Spark’s capabilities. No matter whether you run Hadoop 1.x or Hadoop 2.0 (YARN), and no matter whether you have administrative privileges to configure the Hadoop cluster or not, there is a way for you to run Spark! In particular, there are three ways to deploy Spark in a Hadoop cluster: standalone, YA...
["Patrick Wendell"]["Apache Spark","Engineering Blog"]{"createdOn":"2013-12-20","publishedOn":"2013-12-20","tz":"UTC"}We are happy to announce the release of Apache Spark 0.8.1. In addition to performance and stability improvements, this release adds three new features. First, Spark now supports for the newest versions of YARN (2.2+). Second, the standalone cluster manager supports a high-availability mode in which it can tolerate master failures. Third, shuffles have been optimized to create fewer files, improving shuffle performance drastically in some settings. In conjunction with the Apache Spark 0.8.1 release we are separately releasing <a href="https://databricks.com/blog/2014/01/01/simr.html">Spark In MapReduce (SIMR)</a>, which enables seamlessly running Spark on Hadoop MapReduce v1 clusters without requiring the installation of Scala or Spark. While Apache Spark 0.8.1 is a minor release, it includes these larger features for the benefit of Scala 2.9 users. The next major release of Apache Spark, 0.9.0, will be based on Scala 2.10. This release was a community effort, featuring contribution...
["Andy Konwinski"]["Company Blog","Customers","Events"]{"createdOn":"2013-12-19","publishedOn":"2013-12-19","tz":"UTC"}Earlier this month we held the <a href="http://spark-summit.org/2013">first Spark Summit</a>, a conference to bring the Apache Spark community together. We are excited to share some statistics and highlights from the event. <ul> <li>450 participants from over 180 companies attended</li> <li>Participants came from 13 countries</li> <li>Spark training was sold out at 200 participants from 80 companies</li> <li>20 organizations sponsored the event, including all major Hadoop platform vendors</li> <li>20 different organizations gave talks</li> </ul> Videos and slides for all talks are now available on the <a href="http://spark-summit.org/2013">Summit 2013 page</a>. The Summit included Keynotes from Databricks, the UC Berkeley AMPLab, and Yahoo, as well as presentations from 18 other companies including Amazon, Red Hat, and Adobe. Talk topics covered a wide range including specialized applications such as mapping and manipulating the brain, product launches, and research projects...
["Pat McDonough"]["Apache Spark","Engineering Blog"]{"createdOn":"2013-11-22","publishedOn":"2013-11-22","tz":"UTC"}[sidenote]A version of this post appears on the <a href="http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/">Cloudera Blog</a>.[/sidenote] <hr/> Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms. Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use. <h2 id="fast-and-easy-big-data-processing-with-spark">Fast and ...
["Ion Stoica"]["Company Blog","Partners"]{"createdOn":"2013-10-29","publishedOn":"2013-10-29","tz":"UTC"}Today, Cloudera announced that it will distribute and support Apache Spark. We are very excited about this announcement, and what it brings to the Spark platform and the open source community. So what does this announcement mean for Spark? First, it validates the maturity of the Spark platform. Started as a research project at UC Berkeley in 2009, Spark is the first general purpose cluster computing engine that can run sophisticated computations at memory speeds on Hadoop clusters. Spark started with the goal of providing efficient support for iterative algorithms (such as machine learning) and interactive queries, workloads not well supported by MapReduce. Since then, Spark has grown to support other applications such as streaming, and has gained rapid industry adoption. Today, Spark is used in production by numerous companies, and it counts on an ever growing open source community with over 90 contributors from 25 companies. Second, it will make the Spark platform available to a wi...
["Matei Zaharia"]["Announcements","Company Blog"]{"createdOn":"2013-10-28","publishedOn":"2013-10-28","tz":"UTC"}This year has seen unprecedented growth in both the user and contributor communities around <a href="http://spark.incubator.apache.org">Apache Spark</a>. This rapid growth validates the tremendous potential of the platform, and shows the great excitement around it. While Spark started as a research project by a few grad students at UC Berkeley in 2009, today <strong>over 90 developers from 25 companies have contributed to Spark</strong>. This is not counting contributors to Shark (Hive on Spark), of which there are 25. Indeed, out of the many new big data engines created in the past few years, <strong>Spark has the largest development community after Hadoop MapReduce</strong>. We believe that new components in the project, like <a href="http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html">Spark Streaming</a> and <a href="http://spark.incubator.apache.org/docs/latest/mllib-guide.html">MLlib</a>, will only increase this growth. <h2>Growth by Numbers</h2> To gi...
["Ion Stoica","Matei Zaharia"]["Announcements","Company Blog"]{"createdOn":"2013-10-27","publishedOn":"2013-10-27","tz":"UTC"}When we announced that the original team behind <a href="http://spark.incubator.apache.org">Apache Spark</a> is starting a company around the project, we got a lot of excited questions. What areas will the company focus on, and what will it mean for the open source project? Today, in our first blog post at Databricks, we’re happy to share some of our goals, and say a little about what we’re doing next with Spark. To start with, our mission at Databricks is simple: we want to build the very best computing platform for extracting value from data. Big data is a tremendous opportunity that is still largely untapped, and we’ve been working for the past six years to transform what can be done with it. Going forward, we are fully committed to building out the open source Apache Spark platform to achieve this goal. <h2 id="how-we-think-about-big-data-speed-and-sophistication">How We Think about Big Data: Speed and Sophistication</h2> In the past few years, open source technologies like Hadoop...
["Arsalan Tavakoli-Shiraji"]["Company Blog","Partners"]{"createdOn":"2014-04-11","publishedOn":"2014-04-11","tz":"UTC"}Today, MapR announced that it will distribute and support the Apache Spark platform as part of the MapR Distribution for Hadoop in partnership with Databricks. We’re thrilled to start on this journey with MapR for a multitude of reasons. One of our primary goals at Databricks is to drive broad adoption of Spark and ensure everybody who uses it has a fantastic experience. This partnership will enable all of MapR’s enterprise customers, existing and new, to leverage Spark with the backing of the same great enterprise support available for the rest of MapR’s Hadoop Distribution. As Tomer mentioned in his <a href="/blog/2014/04/10/MapR-Integrates-Spark-Stack.html">blog post</a>, Spark is one of the most common topics in discussions with MapR’s existing customers and many are even already running it in production! A core part of Spark’s value proposition is the ability to easily build a unified end-to-end workflow where critical functions are first class citizens that are seamlessly integ...
["Prashant Sharma","Matei Zaharia"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-04-15","publishedOn":"2014-04-15","tz":"UTC"}One of Apache Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of <a href="http://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html">lambda expressions</a> in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Apache Spark 1.0. <h2 id="a-few-examples">A Few Examples</h2> The following examples show how Java 8 makes code more concise. In our first example, we search a log file for lines that contain “error”, using Spark’s <code>filter</code> and <code>count</code> operations. The code is simple to write, but passing a Function object to <code>filter</code> is clunky: <h5 id="java-7-search-example">Java 7 search example:</h5> <pre>JavaRDD<String> lines = sc.textFile("hdfs://log.txt").filter( n...
["Databricks Training Team"]["Announcements","Company Blog","Events"]{"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"}Databricks is excited to launch its training program, starting with <a title="Spark Training" href="https://databricks.com/training">a series of hands-on Apache Spark workshops</a> designed by the creators of Apache Spark. The first workshop, <em>Introduction to Apache Spark</em>, establishes the fundamentals of using Spark for data exploration, analysis, and building big data applications. This one day workshop is hands-on, covering topics such as: interactively working with Spark's core APIs, learning the key concepts of big data, deploying applications on common Hadoop distributions, and unifying data pipelines with SQL, Streaming, and Machine Learning. Workshops are currently scheduled in New York, San Jose, Austin, and Chicago, with workshops in more cities to come. Visit <a title="Databricks Training" href="https://databricks.com/training">Databricks' training page</a> to find more information and please leave feedback there if you'd like to see a workshop in your area. <ul cla...
["Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)"]["Company Blog","Partners"]{"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://atigeo.com">Atigeo</a> announcing the certification of their xPatterns offering.</div> <hr /> Here at <a href="http://atigeo.com/">Atigeo</a>, we are always looking for ways to build on, improve, and expand our big data analytics platform, Atigeo xPatterns. More than that, both our development and product management team are focused on big data and on knowing what is right for our customers: data scientists and application developers at companies who are seeking to make the best possible use of their data assets. So we all stay on the lookout for the most useful, advanced, and best-performing set of technologies available. Apache Spark, for us, was a standout: We could see that making a dramatic performance improvement available to our customers and users would mean that xPattern’s analytics, modeling, and machine learning would be more responsive, and that Spark in xPatterns would give our customer...
["Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)"]["Company Blog","Partners"]{"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.gopivotal.com" target="_blank">Pivotal</a> describing why they’re excited to deliver Apache Spark on their world class Pivotal HD big data analytics platform suite.</div> <hr /> Today, we are excited to announce the immediate availability of the full Apache Spark stack on Pivotal HD. We have been impressed with the rapid adoption of Spark as a replacement for Hadoop’s more traditional processing engines as well as its vibrant ecosystem, and are thrilled to make it possible for Pivotal customers to run Apache Spark on Pivotal HD Hadoop. Just as important is how we’re doing it: Pivotal HD will be part of Databricks’ upcoming certification program – meaning a commitment to provide compatibility with Apache Spark and support the growing ecosystem of Spark applications. <h2>PivotalHD and Spark</h2> Unlike a multi-vendor patchwork of heterogeneous solutions, Pivotal brings together an integrated ful...
["Patrick Wendell"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-05-30","publishedOn":"2014-05-30","tz":"UTC"}Today, we’re very proud to announce the release of <a title="Spark 1.0.0 Release Notes" href="http://spark.apache.org/releases/spark-release-1-0-0.html">Apache Spark 1.0</a>. Apache Spark 1.0 is a major milestone for the Spark project that brings both numerous new features and strong API compatibility guarantees. The release is also a huge milestone for the Spark developer community: with more than 110 contributors over the past 4 months, it is Spark’s largest release yet, continuing a trend that has quickly made Spark the most active project in the Hadoop ecosystem. <h2>New Features</h2> What features are we most excited about in Apache Spark 1.0? While there are dozens of new features in the release, we’d like to highlight three. <b>Spark SQL</b> The biggest single addition to Apache Spark 1.0 is Spark SQL, a new module that <a title="Spark SQL" href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">we’ve previously blogged about</a>...
["Michael Armbrust","Zongheng Yang"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"}With <a title="Announcing Spark 1.0" href="https://databricks.com/blog/2014/05/30/announcing-spark-1-0.html">Apache Spark 1.0</a> out the door, we’d like to give a preview of the next major initiatives in the Spark project. Today, the most active component of Spark is <a title="Spark SQL: Manipulating Structured Data Using Spark" href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">Spark SQL</a> - a tightly integrated relational engine that inter-operates with the core Spark API. Spark SQL was released in Spark 1.0, and will provide a lighter weight, agile execution backend for future versions of Shark. In this post, we’d like to highlight some of the ways in which tight integration into Scala and Spark provide us powerful tools to optimize query execution with Spark SQL. This post outlines one of the most exciting features, dynamic code generation, and explains what type of performance boost this feature can offer using queries from a...
["Michael Hiskey (VP at MicroStrategy Inc.)"]["Company Blog","Partners"]{"createdOn":"2014-06-04","publishedOn":"2014-06-04","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.microstrategy.com" target="_blank">MicroStrategy</a> describing why they're excited to have their platform "Certified on Apache Spark".</div> <hr /> <h2>The Need for Speed</h2> Over the past few years, we have seen Hadoop emerge as an effective foundation for many organizations’ big data management frameworks, but as the volume and varieties of data increase, speed continues to be a challenge. More and more of our customers are embracing Big Data, and the value of their investment is dependent on (and limited by) how quickly they can take data to action. We’ve been listening to our clients to understand how we can innovate to stay ahead of the curve to help solve these challenges. Apache Spark grabbed our attention because it addresses many of the limitations of Hadoop’s traditional functionality. Plus, Spark is simply impossible to ignore. The active, growing community of developers and enterpri...
["Christopher Nguyen (CEO &amp; Co-Founder of Adatao)"]["Company Blog","Partners"]{"createdOn":"2014-06-11","publishedOn":"2014-06-11","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.arimo.com" target="_blank">Arimo</a> describing why and how they bet on Apache Spark.</div> <hr /> In early 2012, a group of engineers with background in distributed systems and machine learning came together to form Arimo. We saw a major unsolved problem in the nascent Hadoop ecosystem: it was largely a storage play. Data was sitting passively on HDFS, with very little value being extracted. To be sure, there was MapReduce, Hive, Pig, etc., but value is a strong function of (a) speed of computation, (b) sophistication of logic, and (c) ease of use. While Hadoop ecosystem was being developed well at the substrate, there was enormous opportunities above it left uncaptured. <strong>On speed:</strong> we had seen data move at-scale and at enormously faster rates in systems like Dremel and PowerDrill at Google. It enabled interactive behavior simply not available to Hadoop users. Without doubt, we k...
["Databricks Press Office"]["Company Blog","Events"]{"createdOn":"2014-06-12","publishedOn":"2014-06-12","tz":"UTC"}<ul> <li>Three-Day Event in San Francisco Invites Attendees to Gain Insights from the Leading Organizations in Big Data</li> <li>Keynote Speakers Include Executives from Databricks, Cloudera, MapR, DataStax, Jawbone and More</li> <li>Spark Summit Features Different Tracks for Applications, Development, Data Science and Research</li> </ul> &nbsp; BERKELEY, Calif.--(BUSINESS WIRE)-- Databricks and the sponsors of Spark Summit 2014 today announced the full agenda for the summit, including a host of exciting keynotes and community talks. The event will be held June 30–July 2, 2014, at The Westin St. Francis in San Francisco. Spark Summit 2014 arrives at an exciting time for the Apache Spark platform, which has become the most active open source project in the Hadoop ecosystem with more than 200 contributors in the past year. Now available in all major Hadoop distributions, Spark has fostered a fast-growing community on the strength of its technical capabilities, which make big data...
["Dean Wampler (Typesafe)"]["Company Blog","Partners"]{"createdOn":"2014-06-13","publishedOn":"2014-06-13","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.lightbend.com" target="_blank">Lightbend</a> after having their Lightbend Activator Apache Spark templates be "Certified on Apache Spark".</div> <hr /> <h2>Apache Spark and the Lightbend Reactive Platform: A Match Made in Heaven</h2> When I started working with Hadoop several years ago, it was frustrating to find that writing Hadoop jobs was hard to do. If your problem fits a query model, then <a title="Hive" href="http://hive.apache.org" target="_blank">Hive</a> provides a SQL-based scripting tool. For many common dataflow problems, <a href="http://pig.apache.org" target="_blank">Pig</a> provides useful abstractions, but it isn't a full-fledged, "Turing-complete" language. Otherwise, you had to use the low-level <a href="http://wiki.apache.org/hadoop/MapReduce" target="_blank">Hadoop MapReduce</a> API. Some third-party APIs exist that wrap the MapReduce API, such as <a href="http://cascading.org...
["Hari Kodakalla (EVP at Apervi Inc.)"]["Company Blog","Partners"]{"createdOn":"2014-06-23","publishedOn":"2014-06-23","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.apervi.com" target="_blank">Apervi</a> after having their Conflux Director™ application be "Certified on Apache Spark".</div> <hr /> <h2>Big Data on Steroids with Apache Spark</h2> As big data takes center stage in the new data explosion, Hadoop has emerged as one the leading technologies addressing the challenges in the space. As the data processing needs of enterprises are growing newer technologies like Apache Spark have emerged as significant options that consistently offer expanded capabilities for the big data space. As these enterprise needs are met, so is the increased appetite for faster processing, low latency requirements for high velocity data and an iterative demand for processing where leading technologies like Hadoop fall short of expectations or at times seem cumbersome to implement due to its inherent design. Delivering on this growing need of enterprises is where Spark plays a ...
["Bill Kehoe (Big Data Architect at Qlik)"]["Company Blog","Partners"]{"createdOn":"2014-06-24","publishedOn":"2014-06-24","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.qlik.com" target="_blank">Qlik</a> describing how Apache Spark enables the full power of QlikView, recently Certified on Apache Spark, and its Associative Experience feature over the entire HDFS data set.</div> <hr /> <h2>The Power of Qlik</h2> Qlik provides software and services that help make understanding data a natural part of how people make decisions. Our product, QlikView, is the leading Business Discovery platform that incorporates a unique, associative experience that empowers business users to follow their own path to formulate and answer questions that lead to better decisions. Traditional, query-based BI tools force users thru pre-defined navigation paths which limit the kinds of questions that can be answered and require costly and time consuming revisions to address evolving business needs. In contrast, when a user selects data items using QlikView, all the fields and charts are imm...
["Databricks Press Office"]["Announcements","Company Blog"]{"createdOn":"2014-06-26","publishedOn":"2014-06-26","tz":"UTC"}<em>Certified distributions maintain compatibility with open source Apache Spark distribution and thus support the growing ecosystem of Apache Spark applications</em> <hr /> <strong>BERKELEY, Calif. -- June 26, 2014 --</strong> Databricks, the company founded by the creators of Apache Spark, the next generation Big Data engine, today announced the <a href="https://databricks.com/spark/certification/certified-spark-distribution" target="_blank">“Certified Spark Distribution” </a>program for vendors with a commercial Spark distribution. Certification indicates that the vendor’s Spark distribution is compatible with the open source Apache Spark distribution, enabling “Certified on Spark” applications - certified to work with Apache Spark - to run on the vendor’s Spark distribution out-of-the-box. “One of Databricks’ goals is to ensure users have a fantastic experience. Our belief is that having the community work together to maintain compatibility and therefore facilitate a vibrant app...
["Costin Leau (Engineer at Elasticsearch)"]["Company Blog","Partners"]{"createdOn":"2014-06-28","publishedOn":"2014-06-28","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.elasticsearch.com" target="_blank">Elasticsearch</a> announcing Elasticsearch is now "Certified on Apache Spark", the first step in a collaboration to provide tighter integration between Elasticsearch and Spark.</div> <hr /> <h2>Elasticsearch Now “Certified on Spark”</h2> Helping businesses get insights out of their data, fast, is core to the mission of Elasticsearch. Being able to live wherever a business stores their data is obviously critical to that mission, and Hadoop is one of the leaders in providing a way for businesses to store massive amounts of data at scale. Over the course of the past year, we have been working hard to bring the power of our real-time search and analytics engine to the Hadoop ecosystem. Our Hadoop connector, Elasticsearch for Apache Hadoop, is compatible with the top three Hadoop distributions – Cloudera, Hortonworks and MapR – and today has achieved another exciting...
["Jake Cornelius (SVP of Product Management at Pentaho)"]["Company Blog","Partners"]{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}[sidenote]This post is guest authored by our friends at <a href="http://www.pentaho.com" target="_blank">Pentaho</a> after having their data integration and analytics platform <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a>[/sidenote] <hr /> One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in <a href="http://www.pentaho.com/what-is-big-data" target="_blank">Big Data</a> to solve new challenges using the existing skill sets they have in their organizations today. Our Pentaho Labs prototyping and innovation efforts around natively integrating data engineering and analytics with Big Data platforms like <a href="http://www.pentaho.com/what-is-hadoop" target="_blank">Hadoop</a> and <a href="http://www.pentaho.com/storm" target="_blank">Storm</a> have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include <a href="http://www.pent...
["SriSatish Ambati (CEO of 0xData)"]["Company Blog","Partners"]{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.0xdata.com" target="_blank">0xData</a> discussing the release of Sparkling Water - the integration of their H20 offering with the Apache Spark platform.</div> <hr /> <h3>H20 – The Killer-App on Apache Spark</h3> <img class="aligncenter size-full wp-image-62" src="https://databricks.com/wp-content/uploads/2014/06/Spark-+-H20.png" width="472" /> In-memory big data has come of age. The Apache Spark platform, with its elegant API, provides a unified platform for building data pipelines. H2O has focused on scalable machine learning as the API for big data applications. Spark + H2O combines the capabilities of H2O with the Spark platform – converging the aspirations of data science and developer communities. H2O is the Killer-Application for Spark. <img class="aligncenter size-full wp-image-62" src="https://databricks.com/wp-content/uploads/2014/06/H20-the-Killer-App.png" width="472" /> <h3>Backdrop<...
["Databricks Press Office"]["Announcements","Company Blog"]{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}<ul> <li>Databricks Cloud Allows Users to Get Value from Apache Spark without the Challenges Normally Associated with Big Data Infrastructure</li> <li>Ease-of-Use of Turnkey Solution Brings the Power of Spark to a Wider Audience and Fuels the Growth of the Spark Ecosystem</li> <li>Funding Led by NEA with Follow-on Investment from Andreessen Horowitz</li> </ul> <strong>Berkeley, Calif. (June 30, 2014)</strong>—Databricks, the company founded by the creators of Apache Spark—the powerful open-source processing engine that provides blazingly fast and sophisticated analytics—announced today the launch of <a title="Databricks Cloud" href="https://databricks.com/cloud">Databricks Cloud</a>, a cloud platform built around Apache Spark. In addition to this launch, the company is announcing the close of $33 million in series B funding led by New Enterprise Associates (NEA) with follow-on investment from Andreessen Horowitz. “Getting the full value out of their Big Data investments is still...
["Arsalan Tavakoli-Shiraji"]["Company Blog","Events"]{"createdOn":"2014-04-29","publishedOn":"2014-04-29","tz":"UTC"}At Databricks, we’ve been thrilled to see the rapid pace of adoption of Apache Spark, as it has been embraced by an increasing number of enterprise vendors and has grown to be the most active open source project in the Hadoop ecosystem. We also know that a critical piece of enabling enterprises to unlock its potential is a strong ecosystem of applications built on top of or integrated with Spark. We launched the <a href="http://www.databricks.com/certification/">“Certified on Apache Spark”</a> program to support these application developer efforts, and have been blown away at the diverse set of applications being built on top of Spark, and want this great work to be exposed to the broader community. In that light, this year’s Spark Summit will have an “Application Spotlight” segment that will highlight some of the best we’ve seen. Read on for details on how to apply and what selection entails. All applications eligible (even if not yet certified) for the Databricks “Certified on Spar...
["Arsalan Tavakoli-Shiraji"]["Company Blog","Partners"]{"createdOn":"2014-05-08","publishedOn":"2014-05-08","tz":"UTC"}<p>Today, Datastax and Databricks announced a partnership in which Apache Spark becomes an integral part of the Datastax offering, tightly integrated with Cassandra. We’re very excited to be embarking on this journey with Datastax for a multitude of reasons:</p> <h2 id="integrating-operational-systems-with-analytics">Integrating operational systems with analytics</h2> <p>One of the use cases that we’ve increasingly been asked about by Spark users is the ability to create a closed loop system: perform advanced analytics directly on operational data that is then fed back into the operational system to drive necessary adaptation. The tight integration of Cassandra and Spark will enable users to achieve this goal by leveraging Cassandra as the high-performance transactional database that powers online applications and Spark as a next generation processing engine that can deliver deeper insights, faster while seamlessly moving between the two.</p> <h2 id="spark-beyond-hadoop">Spark beyond...
["Databricks Press Office"]["Announcements","Company Blog"]{"createdOn":"2014-04-30","publishedOn":"2014-04-30","tz":"UTC"} <p><strong>VANCOUVER, BC. – April 30, 2014 –</strong> Simba Technologies Inc., the industry’s expert for Big Data connectivity, announced today that Databricks has licensed Simba’s ODBC Driver as its standards-based connectivity solution for Shark, the SQL front-end for Apache Spark, the next generation Big Data processing engine. Founded by the creators of Apache Spark and Shark, Databricks is developing cutting-edge systems to enable enterprises to discover deeper insights, faster.</p> <p>“We believe that Big Data is a tremendous opportunity that is still largely untapped, and we are working to revolutionize what organizations can do with it,” says Ion Stoica, Chief Executive Officer at Databricks, and Professor of Computer Science at UC Berkeley. “As part of this mission, we understand that BI tools will continue to be a key medium for consuming data and analytics and are excited to announce the availability of an enterprise-grade connectivity option for users of BI tools. ...
["Databricks Press Office"]["Announcements","Company Blog","Partners"]{"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"}<strong>SAN FRANCISCO — July 1, 2014</strong> — Databricks, the company founded by the creators of Apache Spark – the popular open-source processing engine - today announced a new partnership with <a href="http://www.sap.com" target="_blank">SAP (NYSE: SAP)</a> and to deliver a Databricks-certified Apache Spark distribution offering for the SAP HANA® platform. The full production-ready distribution offering, based on Apache Spark 1.0, is deployable in the cloud or on premise and available for immediate download from SAP at no cost at <a href="http://spr.ly/SAP_and_Spark" target="_blank">spr.ly/SAP_and_Spark</a>. The announcement was made at the Spark Summit 2014, being held June 30 – July 2 in San Francisco. The Databricks-certified distribution offering for SAP HANA contains the Spark processing engine that works with any Hadoop distribution out of the box, providing a more complete data store and processing layer for Hadoop. Certified by Databricks to be compatible with the Apache ...
["Arsalan Tavakoli-Shiraji"]["Company Blog","Partners"]{"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"}This morning SAP released its own “Certified Spark Distribution” as part of a brand new partnership announced between Databricks and SAP. We’re thrilled to be embarking on this journey with them, not just because of what it means for Databricks as a company, but just as importantly because of what it means for Apache Spark and the Spark community. <h2>Access to the full corpus of data</h2> Fundamentally, every enterprise's big data vision is to convert data into value; a core ingredient in this quest is the availability of the data that needs to be mined for insights. Although the growth in volume of data sitting in HDFS has been incredible and continues to grow exponentially, much of this has been contextual data - e.g., social data, click-stream data, sensor data, logs, 3rd party data sources - and historical data. Real-time operational data - e.g., data from foundational enterprise applications such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), and S...
["Reynold Xin"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-07-02","publishedOn":"2014-07-02","tz":"UTC"}With the introduction of Spark SQL and the new Hive on Apache Spark effort (<a href="https://issues.apache.org/jira/browse/HIVE-7292">HIVE-7292</a>), we get asked a lot about our position in these two projects and how they relate to Shark. At the <a href="http://spark-summit.org/2014">Spark Summit</a> today, we announced that we are ending development of Shark and will focus our resources towards Spark SQL, which will provide a superset of Shark’s features for existing Shark users to move forward. In particular, Spark SQL will provide both a seamless upgrade path from Shark 0.9 server and new features such as integration with general Spark programs. <img class="alignnone wp-image-818 size-large" src="https://databricks.com/wp-content/uploads/2014/07/sql-directions-1024x691.png" alt="Future of SQL on Spark" width="400" /> <h2>Shark</h2> When the Shark project started 3 years ago, Hive (on MapReduce) was the only choice for SQL on Hadoop. Hive compiled SQL into scalable MapReduce jobs a...
["Ion Stoica"]["Company Blog","Product"]{"createdOn":"2014-07-14","publishedOn":"2014-07-14","tz":"UTC"}Our vision at Databricks is to <strong>make big data easy</strong> so that we enable <strong>every</strong> organization to turn its data into value. At Spark Summit 2014, we were very excited to unveil <a href="https://databricks.com/cloud" target="_blank">Databricks</a>, our first product towards fulfilling this vision. In this post, I’ll briefly go over the challenges that data scientists and data engineers face today when working with big data, and then show how Databricks addresses these challenges. <h2>Today’s Big Data Challenges</h2> While the promise of big data to <a href="http://spark-summit.org/2014/talk/using-spark-to-generate-analytics-for-international-cable-tv-video-distribution" target="_blank">improve businesses</a>, <a href="http://spark-summit.org/2014/talk/david-patterson" target="_blank">save lives</a>, and <a href="http://spark-summit.org/2014/talk/A-platform-for-large-scale-neuroscience" target="_blank">advance science</a> is becoming more and more real, analyzi...
["Xiangrui Meng"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2014-07-16","publishedOn":"2014-07-16","tz":"UTC"}MLlib is an Apache Spark component focusing on machine learning. It became a standard component of Spark in version 0.8 (Sep 2013). The initial contribution was from Berkeley AMPLab. Since then, 50+ developers from the open source community have contributed to its codebase. With the release of Apache Spark 1.0, I’m glad to share some of the new features in MLlib. Among the most important ones are: <ul> <li>sparse data support</li> <li>regression and classification trees</li> <li>distributed matrices</li> <li>PCA and SVD</li> <li>L-BFGS optimization algorithm</li> <li>new user guide and code examples</li> </ul> This is the first in a series of blog posts about features and optimizations in MLlib. We will focus on one feature new in 1.0 — sparse data support. <h2>Large-scale ≈ Sparse</h2> When I was in graduate school, I wrote “large-scale sparse least squares” in a paper draft. My advisor crossed out the word “sparse” and left a comment: “Large-scale already implies sparsity...
["Matei Zaharia"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-07-19","publishedOn":"2014-07-19","tz":"UTC"}<div class="post-meta">This post originally appeared in <a href="http://inside-bigdata.com/2014/07/15/theres-spark-theres-fire-state-apache-spark-2014/" target="_blank">insideBIGDATA</a> and is reposted here with permission.</div> <hr /> With the second <a href="http://spark-summit.org/2014">Spark Summit</a> behind us, we wanted to take a look back at our journey since 2009 when Apache Spark, the fast and general engine for large-scale data processing, was initially developed. It has been exciting and extremely gratifying to watch Spark mature over the years, thanks in large part to the vibrant, open source community that latched onto it and busily began contributing to make Spark what it is today. The idea for Spark first emerged in the AMPLab (AMP stands for Algorithms, Machines, and People) at the University of California, Berkeley. With its significant industry funding and exposure, the AMPlab had a unique perspective on what is important and what issues exist among early adopte...
["Burak Yavuz","Xiangrui Meng","Reynold Xin"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"}Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Python (<a href="http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html">Scala/Java APIs also available</a>).<!--more--> [python] from pyspark.mllib.recommendation import ALS # load training and test data into (user, product, rating) tuples def parseRating(line): fields = line.split() return (int(fields[0]), int(fields[1]), float(fields[2])) training = sc.textFile(&quot;...&quot;).map(parseRating).cache() test = sc.textFile(&quot;...&quot;).map(parseRating) # train a recommendation model model = ALS.train(tra...
["Li Pu","Reza Zadeh"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2014-07-22","publishedOn":"2014-07-22","tz":"UTC"}<div class="post-meta">Guest post by Li Pu from Twitter and Reza Zadeh from Databricks on their recent contribution to Apache Spark's machine learning library.</div> <hr /> The <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition (SVD)</a> is one of the cornerstones of linear algebra and has widespread application in many real-world modeling situations. Problems such as recommender systems, linear systems, least squares, and many others can be solved using the SVD. It is frequently used in statistics where it is related to principal component analysis (PCA) and to correspondence analysis, and in signal processing and pattern recognition. Another usage is latent semantic indexing in natural language processing. Decades ago, before the rise of distributed computing, computer scientists developed the single-core <a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK package</a> for computing the eigenvalue decomposition of a matrix. Since...
["Scott Walent"]["Company Blog","Events"]{"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"}From June 30 to July 2, 2014 we held the <a href="http://spark-summit.org/2014">second Spark Summit</a>, a conference focused on promoting the adoption and growth of <a href="http://spark.apache.org">Apache Spark</a>. This was an exciting year for the Spark community and we are proud to share some highlights. <ul> <li>1,164 participants from over 453 companies attended</li> <li>Spark Training sold out at 300 participants</li> <li>31 organizations sponsored the event</li> <li>12 keynotes and 52 community presentations were given</li> </ul> &nbsp; Videos and slides from all presentations are now available on the <a href="http://spark-summit.org/2014/agenda">Summit 2014 agenda</a> page. Some highlights include: <ul> <li>Spark Summit <a href="https://www.youtube.com/watch?v=lO7LhVZrNwA&amp;index=2&amp;list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr">keynote from Databricks CEO Ion Stoica</a> introducing <a href="http://www.databricks.com/cloud">Databricks Cloud</a></li> <li>Open source comm...
["Oscar Mendez (CEO of Stratio)"]["Company Blog","Partners"]{"createdOn":"2014-08-08","publishedOn":"2014-08-08","tz":"UTC"}<div class="post-meta">This is a guest post from our friends at <a href="http://www.stratio.com" target="_blank">Stratio</a> announcing that their platform is now a "Certified Apache Spark Distribution".</div> <hr /> <h2>Certified distribution</h2> Stratio is delighted to announce that it is officially a Certified Apache Spark Distribution. The certification is very important for us because we deeply believe that the certification program provides many benefits to the Spark community: It facilitates collaboration and integration, offers broad evolution and support for the rich Spark ecosystem, simplifies adoption of critical security updates and allows development of applications valid for any certified distribution - a key ingredient for a successful ecosystem. <!--more--> This post is a brief history of how we started with big data technologies until we made the shift to Spark. <h2>When Stratio met Spark: A true love story</h2> We started using Big Data technologies more than 7 yea...
["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2014-08-15","publishedOn":"2014-08-15","tz":"UTC"}<div class="post-meta">This is a guest blog post from our friends at Alibaba Taobao.</div> <hr /> Alibaba Taobao operates one of the world’s largest e-commerce platforms. We collect hundreds of petabytes of data on this platform and use Apache Spark to analyze these enormous amounts of data. Alibaba Taobao probably runs some of the largest Spark jobs in the world. For example, some Spark jobs run for weeks to perform feature extraction on petabytes of image data. In this blog post, we share our experience with Spark and GraphX from prototype to production at the Alibaba Taobao Data Mining Team. <!--more--> Every day, hundreds of millions of users and merchants interact on Alibaba Taobao’s marketplace. These interactions can be expressed as complicated, large scale graphs. Mining data requires a distributed data processing engine that can support fast interactive queries as well as sophisticated algorithms. Spark and GraphX embed a standard set of graph mining algorithms, including ...
["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2014-08-27","publishedOn":"2014-08-27","tz":"UTC"}One of our philosophies in Apache Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate various components of a data pipeline. <!--more-->We’re pleased to announce Apache Spark 1.1. ships with built-in support for several statistical algorithms common in exploratory data pipelines: <ul> <li><strong>correlations</strong>: data dependence analysis</li> <li><strong>hypothesis testing</strong>: goodness of fit; independence test</li> <li><strong>stratified sampling</strong>: scaling training set with controlled label distribution</li> <li><strong>random data generation</strong>: randomized algorithms; performance t...
["Patrick Wendell"]["Apache Spark","Engineering Blog","Streaming"]{"createdOn":"2014-09-12","publishedOn":"2014-09-12","tz":"UTC"}Today we’re thrilled to announce the release of Apache Spark 1.1! Apache Spark 1.1 introduces many new features along with scale and stability improvements. This post will introduce some key features of Apache Spark 1.1 and provide context on the priorities of Spark for this and the next release.<!--more--> In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Apache Spark 1.1 is already available to Databricks customers and has also been posted today on the <a href="http://spark.apache.org/releases/spark-release-1-1-0.html">Apache Spark website</a>. <!--more--> <h2>Maturity of SparkSQL</h2> The 1.1 released upgrades Spark SQL significantly from the preview delivered in Apache Spark 1.0. At Databricks, we’ve migrated all of our customer workloads from Shark to Spark SQL, with between 2X and 5X <a href="https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html">perfo...
["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]["Apache Spark","Engineering Blog","Streaming"]{"createdOn":"2014-09-16","publishedOn":"2014-09-16","tz":"UTC"}With Apache Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components - Spark Streaming - and highlight who is using Spark Streaming and why. Apache Spark 1.1. adds several new features to Spark Streaming.  In particular, Spark Streaming extends its library of ingestion sources to include Amazon Kinesis, a hosted stream processing engine, as well as to provide high availability for Apache Flume sources.  Moreover, Apache Spark 1.1 adds the first of a set of online machine learning algorithms with the introduction of a streaming linear regression. Many organizations have evolved from exploratory, discovery use cases of big data to use cases that require reasoning on data as it arrives in order to make decisions in real time.  Spark Streaming enables this category of high-value use cases, providing a system for processing fast and large streams of data in real time. <b>What is it?</b> Spark Streaming is an extension of the core S...
["Burak Yavuz","Xiangrui Meng"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2014-09-22","publishedOn":"2014-09-22","tz":"UTC"}With an ever-growing community, Apache Spark has had it’s <a href="https://databricks.com/blog/2014/09/11/announcing-spark-1-1.html" target="_blank">1.1 release</a>. MLlib has had its fair share of contributions and now supports many new features. We are excited to share some of the performance improvements observed in MLlib since the 1.0 release, and discuss two key contributing factors: torrent broadcast and tree aggregation. <h2>Torrent broadcast</h2> The beauty of Spark as a unified framework is that any improvements made on the core engine come for free in its standard components like MLlib, Spark SQL, Streaming, and GraphX. In Apache Spark 1.1, we changed the default broadcast implementation of Spark from the traditional <code>HttpBroadcast</code> to <code>TorrentBroadcast</code>, a BitTorrent like protocol that evens out the load among the driver and the executors. When an object is broadcasted, the driver divides the serialized object into multiple chunks, and broadcasts the ch...
["Gavin Targonski (Product Management at Talend)"]["Company Blog","Partners"]{"createdOn":"2014-09-15","publishedOn":"2014-09-15","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.talend.com" target="_blank">Talend</a> after having Talend Studio <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> As the move to the next generation of integration platforms grows momentum, the need to implement a proven and scalable technology is critical. Databricks and Apache Spark, delivered on the major Hadoop distributions, is one such area where the delivery of massively scalable technology with low risk implementation is really key. At Talend we see a wide array of batch processes, moving to an operational and real time perspective, driven by the consumers of the data. In this vein, the uptake in adoption and the growing community of Apache Spark, the powerful open-source processing engine, has been hard to miss. In a relatively short time, it is now a part of every major Hadoop vendor’s offering, is the most active open sou...
["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-09-18","publishedOn":"2014-09-18","tz":"UTC"}<div class="post-meta">This is a guest post by Nick Pentreath of <a href="http://graphflow.com">Graphflow</a> and Kan Zhang of <a href="http://ibm.com">IBM</a>, who contributed Python input/output format support to Apache Spark 1.1.</div> <hr /> Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverage the extensive set of third-party libraries available (for example, the many data analysis libraries for Python). Built-in Hadoop support means that Spark can work "out of the box" with any data storage system or format that implements Hadoop's <code>InputFormat</code> and <code>OutputFormat</code> interfaces, including HDFS, HBase, Cassandra, Elasticsearch, DynamoDB and many others, as well as various data serialization formats s...
["Vida Ha"]["Company Blog","Product"]{"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"}At Databricks, we are often asked how to go beyond the basic Apache Spark tutorials and start building real applications with Spark.  As a result, we are developing reference applications <a href="http://github.com/databricks/reference-apps" target="_blank">on github</a> to demonstrate that.  We believe this is a great way to learn Spark, and we plan on incorporating more features of Spark into the applications over time.  We also hope to highlight any technologies that are compatible with Spark and include best practices. <h3>Log Analyzer Application</h3> Our first reference application is log analysis with Spark.  Logs are a large and common data set that contain a rich set of information. Log data can be used for monitoring web servers, improving business and customer intelligence, building recommendation systems, preventing fraud, and much more.  Spark is a wonderful tool to use on logs - Spark can process logs faster than Hadoop MapReduce, it is easy to code so we can compute many...
["John Tripier","Paco Nathan"]["Announcements","Company Blog"]{"createdOn":"2014-09-19","publishedOn":"2014-09-19","tz":"UTC"}When Databricks was initially founded a little more than a year ago, there was tremendous excitement around Apache Spark, but it was still early days. The project had ~60 contributors over the previous 12 months, and was not yet available commercially. One of our main focus areas since then has been continuing to grow Spark and the community and making it easily accessible for enterprises and users alike. Taking a step back, it’s terrific to see the progress that Spark has made since then. Spark is today the most active open source project in the Big Data ecosystem with over 300 contributors in the last 12 months alone, and is available through several platform vendors, including all of the major Hadoop distributors. The <a href="http://www.spark-summit.org" target="_blank">Spark Summit</a>, dedicated to bringing together the Spark community, more than doubled in size a short 6 months after the inaugural version, and Spark meetups continue to grow in size, frequency, and cities sp...
["Christopher Burdorf (Senior Software Engineer at NBC Universal)"]["Company Blog","Customers"]{"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"}<div class="post-meta">This is a guest blog post from our friends at NBC Universal outlining their Apache Spark use case.</div> <hr /> <h2>Business Challenge</h2> NBC Universal is one of the world’s largest media and entertainment companies with revenues of US$ 26 billion. It operates television networks, cable channels, motion picture and television production companies as well as branded theme parks worldwide. Popular brands include NBC, Universal Pictures, Universal Parks &amp; Resorts, Telemundo, E!, Bravo and MSNBC. Digital video media clips for NBC Universal’s cable TV programs and commercials are produced and broadcast from its Los Angeles office to cable TV channels in Asia Pacific, Europe, Latin America and the United States. Moreover, viewers increasingly consume NBC Universal’s vast content library online and on-demand. Therefore, NBC Universal’s IT Infrastructure team needs to make decisions on how best to serve that content, which involves a trade-off between storage a...
["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"]["Engineering Blog","Machine Learning"]{"createdOn":"2014-09-30","publishedOn":"2014-09-30","tz":"UTC"}<div class="post-meta">This is a post written together with one of our friends at <a href="http://www.origamilogic.com/">Origami Logic</a>. Origami Logic provides a Marketing Intelligence Platform that uses Apache Spark for heavy lifting analytics work on the backend.</div> <hr /> Decision trees and their ensembles are industry workhorses for the machine learning tasks of classification and regression. Decision trees are easy to interpret, handle categorical and continuous features, extend to multi-class classification, do not require feature scaling and are able to capture non-linearities and feature interactions. Due to their popularity, almost every machine learning library provides an implementation of the decision tree algorithm. However, most are designed for single-machine computation and seldom scale elegantly to a distributed setting. Apache Spark is an ideal platform for a scalable distributed decision tree implementation since Spark's in-memory computing allows us to effi...
["Eric Carr (VP Core Systems Group at Guavus)"]["Company Blog","Partners"]{"createdOn":"2014-09-25","publishedOn":"2014-09-25","tz":"UTC"}<div class="post-meta">This is a guest blog post from our friends at <a href="http://www.guavus.com" target="_blank">Guavus</a> - now a Certified Apache Spark Distribution - outlining how they leverage Spark to deliver value to telecom companies.</div> <hr /> <h2>Business Challenge</h2> Guavus is a leading provider of big data analytics solutions for the Communications Service Provider (CSP) industry. The company counts 4 of the top 5 mobile network operators, 3 of the top 5 Internet backbone providers, as well as 80% of cable MSOs in North America as customers. The Guavus Reflex platform provides operational intelligence to these service providers. Reflex currently analyzes more than 50% of all US mobile data traffic and processes more than 2.5 petabytes of data per day. Yet that data grows at an exponential rate. Ever increasing data volume and velocity makes it harder to generate timely insights. For instance, one operational issue can quickly cascade into multiple issues down-st...
["Jeremy Freeman (Freeman Lab)"]["Apache Spark","Engineering Blog","Streaming"]{"createdOn":"2014-10-01","publishedOn":"2014-10-01","tz":"UTC"}The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions of neurons somehow work together to endow organisms with the extraordinary ability to interact with the world around them. Things our brains control effortlessly -- kicking a ball, or reading and understanding this sentence -- have proven extremely hard to implement in a machine. For a long time, our efforts were limited by experimental technology. Despite the brain having many neurons, most technologies could only monitor the activity of one, or a handful, at once. That these approaches taught us so much -- for example, that there are neurons that respond only when you look at a particular object -- is a testament to experimental ingenuity. In the next era, however, we will be limited not by our recordings, but our ability to make sense of the data. New technologies make it possible to monitor the activity of many thousands of neurons at once -- fro...
["Russell Cardullo (Sharethrough)"]["Company Blog","Customers"]{"createdOn":"2014-10-07","publishedOn":"2014-10-07","tz":"UTC"}<div class="post-meta">This is a guest blog post from our friends at <a href="http://www.sharethrough.com" target="_blank">Sharethrough</a> providing an update on how their use of Apache Spark has continued to expand.</div> <hr /> <h2>Business Challenge</h2> Sharethrough is an advertising technology company that provides native, in-feed advertising software to publishers and advertisers. Native, in-feed ads are designed to match the form and function of the sites they live on, which is particularly important on mobile devices where interruptive advertising is less effective. For publishers, in-feed monetization has become a major revenue stream for their mobile sites and applications. For advertisers, in-feed ads have been proven to drive more brand lift than interruptive banner advertisements. Sharethrough’s publisher and advertiser technology suite is capable of optimizing the format of an advertisement for seamless placement on content publishers websites and apps. This involves ...
["Sean Kandel (CTO at Trifacta)"]["Company Blog","Partners"]{"createdOn":"2014-10-09","publishedOn":"2014-10-09","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.trifacta.com" target="_blank">Trifacta</a> after having their data transformation platform <a href="http://www.databricks.com/certification" target="_blank">“Certified on Spark.”</a></div> <hr> Today we announced v2 of the Trifacta Data Transformation Platform, a release that emphasizes the important role that Hadoop plays in the new big data enterprise architecture. With Trifacta v2 we now support transforming data of all shapes and sizes in Hadoop. This means supporting Hadoop-specific data formats as both inputs and outputs in Trifacta v2 - data formats such as Avro, ORC and Parquet. It also means intelligently executing data transformation scripts through not only MapReduce, which was available in Trifacta v1, but also Spark. Trifacta v2 has been officially Certified on Spark by Databricks. Our partnership with Databricks brings the performance and flexibility of the Spark data processing en...
["Reynold Xin"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-10-10","publishedOn":"2014-10-10","tz":"UTC"}<strong>Update November 5, 2014</strong>: Our benchmark entry has been reviewed by the benchmark committee and Apache Spark has won the <a href="http://sortbenchmark.org/">Daytona GraySort contest</a> for 2014! Please see this <a href="https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html">new blog post for update</a>. Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it s...
["Reza Zadeh"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2014-10-20","publishedOn":"2014-10-20","tz":"UTC"}<div class="post-meta">Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its <a href="https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum" target="_blank">open-source contribution</a>, with permission. The associated <a href="https://github.com/apache/spark/pull/1778" target="_blank">pull request</a> is slated for release in Apache Spark 1.2.</div> <hr /> <h2>Introduction</h2> We are often interested in finding users, hashtags and ads that are very similar to one another, so they may be recommended and shown to users and advertisers. To do this, we must consider many pairs of items, and evaluate how “similar” they are to one another. We call this the “all-pairs similarity” problem, sometimes known as a “similarity join.” We have developed a new efficient algorithm to solve the similarity join called “Dimension Independent Matrix Square using MapReduce,” or <a href="http://arxiv.org/abs/1304.1467" target="_blank">DIM...
["Jeff Feng (Product Manager at Tableau Software)"]["Company Blog","Partners"]{"createdOn":"2014-10-15","publishedOn":"2014-10-15","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.tableausoftware.com" target="_blank">Tableau Software</a>, whose visual analytics software is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <img class="aligncenter size-full wp-image-62" style="max-width: 100%; display: block; margin: 30px auto 5px auto;" src="https://databricks.com/wp-content/uploads/2014/10/Tableau-SparkSQL.png" alt="" align="middle" /> <h2>Apache Spark - The Next Big Innovation</h2> Once every few years or so, the big data open source community experiences a major innovation that advances the capabilities of data processing frameworks. For many years, MapReduce and the Hadoop open-source platform served as an effective foundation for the distributed processing of large data sets. Then last year, the introduction of YARN provided the resource manager needed to enable interactive workloads, bringing data proce...
["Scott Walent"]["Announcements","Company Blog","Events"]{"createdOn":"2014-10-23","publishedOn":"2014-10-23","tz":"UTC"}The call for presentations for the inaugural <a href="http://spark-summit.org/east">Spark Summit East</a> is now open. Please join us in New York City on March 18-19, 2015 to share your experience with Apache Spark and celebrate its growing community. Spark Summit East is looking for presenters who would like to showcase how Spark and its related technologies are used in applications, development, data science and research. Please visit our <a href="http://www.spark-summit.org/east/2015/CFP">submission page</a> for additional details. The Deadline for submissions is December 5, 2014 at 11:59pm PST. Spark Summit East is the leading event for <a href="http://spark.apache.org">Apache Spark </a>users, developers and vendors. It is an exciting opportunity to meet analysts, researchers, developers and executives interested in utilizing Spark technology to answer big data questions. If you missed <a href="http://spark-summit.org/2014">Spark Summit 2014</a>, all the content is available onl...
["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"]["Company Blog","Partners"]{"createdOn":"2014-10-27","publishedOn":"2014-10-27","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.faimdata.com" target="_blank">Faimdata</a>, whose Consumer Data Intelligence Service is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>Forecasting, Analytics, Intelligence, Machine Learning</h2> Faimdata’s Consumer Data Intelligence Service is a turnkey Big Data solution that provides comprehensive infrastructure and applications to retailers. We help our clients form close connections with their customers and make timely business decisions, using their existing data sources. The unified data processing pipeline deployed by Faimdata has three core focuses. They are (i) our Personalization Service that identifies the personal preferences and buying behaviors of each individual consumer using recommendation/machine learning algorithms; (ii) our Data Analytic Workbench where clients execute high performance multi-dimensional an...
["John Kreisa (VP of Strategic Marketing at Hortonworks)"]["Company Blog","Partners"]{"createdOn":"2014-10-31","publishedOn":"2014-10-31","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.hortonworks.com" target="_blank">Hortonworks</a> announcing a broader partnership with Databricks around Apache Spark.</div> <hr> At Hortonworks we are very excited by the emerging use cases and potential of Apache Spark and Apache Hadoop. Spark is representative of just one of the shifts underway in the data landscape towards memory optimized processing, that when combined with Hadoop, can enable a new generation of applications. We are excited to announce that Hortonworks and Databricks have extended our partnership focus from providing a <a href="https://databricks.com/spark/certification/certified-spark-distribution" target="_blank">Certified Spark Distribution</a> to include a shared vision to further Apache Spark as an enterprise ready component of the Hortonworks Data Platform. We are closely aligned on a strategy and vision of bringing 100% open source software to market for the enterp...
["Sachin Chawla (VP of Engineering)"]["Company Blog","Partners"]{"createdOn":"2014-11-25","publishedOn":"2014-11-25","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.skytree.net" target="_blank">Skytree</a>, whose Skytree Infinity platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>To Infinity and Beyond - Big Data at the speed of light</h2> Astronomers were into Big Data before it was big. In order to learn about the history of the universe, they needed to observe and record billions and billions of astronomical objects and perform heavy-duty analysis on the resulting massive datasets. Available predictive methods were not scalable to the size of data sets they were dealing with so they turned to Skytree to obtain unprecedented performance and accuracy on the largest datasets ever collected. Fast-forward a decade or so and the need to store, access, process and analyze datasets of astronomical sizes is now mainstream in the guise of Big Data analytics. <a href="http://www.skytre...
["Sonal Goyal (CEO)"]["Company Blog","Partners"]{"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://nubetech.co/" target="_blank">Nube Technologies</a>, whose Reifier platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>About Nube Technologies</h2> Nube Technologies builds business applications to better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate. <h2>Why Apache Spark</h2> Data matching within a single source or across sources is a very core problem faced by almost every enterprise and we wanted to create a re...
[" Dibyendu Bhattacharya (Big Data Architect)"]["Company Blog","Partners"]{"createdOn":"2014-12-09","publishedOn":"2014-12-09","tz":"UTC"}<div class="post-meta">This is a guest blog post from our friends at Pearson outlining their Apache Spark use case.</div> <hr /> <h2>Introduction of Pearson</h2> Pearson is a British multinational publishing and education company headquartered in London. It is the largest education company and the largest book publisher in the world. Recently, Pearson announced a new organization structure in order to accelerate their push into digital learning, education services and emerging markets. I am part of Pearson Higher Education group, which provides textbooks and digital technologies to teachers and students across Higher Education. Pearson's higher education brands include eCollege, Mastering/MyLabs and Financial Times Publishing. <h2>What we wanted to do</h2> We are building a next generation adaptive learning platform which delivers immersive learning experiences designed for the way today’s students read, think, and learn. This learning platform is a scalable, reliable, cloud-based pl...
["Reynold Xin"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-11-05","publishedOn":"2014-11-05","tz":"UTC"}A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the <a href="http://sortbenchmark.org/">Daytona GraySort contest</a>! In case you missed our <a href="https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html">earlier blog post</a>, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Apache Spark sorted the same data <strong>3X faster</strong> using <strong>10X fewer machines</strong>. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record. <table class="...
["Matt MacKinnon (Director of Product Management at Zaloni)"]["Company Blog","Partners"]{"createdOn":"2014-11-14","publishedOn":"2014-11-14","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.zaloni.com" target="_blank">Zaloni</a>, whose Bedrock platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>Bedrock’s Managed Data Pipeline now includes Apache Spark</h2> It was evident from the all the buzz at the Strata + Hadoop World conference that Apache Spark has now shifted from the early adopter phase to establishing itself as an integral and permanent part of the Hadoop ecosystem. The rapid pace of adoption is impressive! Given the entrance of Spark into the mainstream Hadoop world, we are glad to announce that Bedrock is now officially Certified on Spark. <h2>How does Spark enhance Bedrock?</h2> Bedrock™ defines a Managed Data Pipeline as consisting of Ingest, Organize, and Prepare stages. Bedrock’s strength lies in the integrated nature of the way data is handled through these stages. ● Ingest: Bring data fr...
["John Tripier","Paco Nathan"]["Announcements","Company Blog"]{"createdOn":"2014-11-15","publishedOn":"2014-11-15","tz":"UTC"}More and more companies are using Apache Spark, and many Spark based pilots are currently deploying in production. In social media, at every big data conference or meetup, people describe new POC, prototypes, and production deployments using Spark. Behind this momentum, a growing need for Spark developers is developing; people who have demonstrated expertise in how to implement best practices for Spark. People who can help the enterprise building increasingly complex and sophisticated solutions on top of their Spark deployments. At Databricks, we get contacted by many enterprises looking for Spark resources to help with their next data-driven initiative. And so beyond our effort to train people on Spark directly or through partners all around the world, we have teamed up with O’Reilly for offering the first industry standard for measuring and validating a developer’s expertise on Spark. <h2>Benefits of being a Spark Certified Developer</h2> The Spark Developer Certification is the wa...
["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"]["Company Blog","Partners"]{"createdOn":"2014-11-22","publishedOn":"2014-11-22","tz":"UTC"}<div class="post-meta">This is a guest blog post from our friends at Samsung SDS outlining their Apache Spark use case.</div> <hr /> <h2>Business Challenge</h2> Samsung SDS is the business and IT solutions arm of Samsung Group. A global ICT service provider with over 17,000 employees worldwide and 6.7 billion USD in revenues, Samsung SDS tackles the challenges of some of the largest global enterprises in such industries as manufacturing, financial services, health care and retail. In the different areas Samsung is focused on, the ability to make timely decisions that maximize the value to a business becomes critical. Prescriptive analytics methods have been used effectively to support decision making by leveraging probable future outcomes determined by predictive models and suggesting actions that provide maximal business value. One of the main challenges in applying prescriptive analytics in these areas is the need to analyze a combination of structured and unstructured data at la...
["Ameet Talwalkar","Anthony Joseph"]["Announcements","Company Blog"]{"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"}In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, data science techniques are fast becoming core components of large-scale data processing pipelines. Apache Spark offers analysts and engineers a powerful tool for building these pipelines, and learning to build such pipelines will soon be a lot easier. Databricks is excited to be working with professors from University of California Berkeley and University of California Los Angeles to produce two new upcoming Massive Open Online Courses (MOOCs). Both courses will be freely available on the edX MOOC platform in <del>spring</del> summer 2015. edX Verified Certificates are also available for a fee. <img class="aligncenter size-full wp-image-62" style="max-width: 100%; display: block; margin: 30px auto 5px auto;" src="https://databricks.com/wp-content/uploads/2014/12/MOOC1.png" alt="" align="middle" /> The first course, called <a href="https://www.edx.org/course/uc...
["Lieven Gesquiere (Virdata Lead Core R&D)"]["Company Blog","Partners"]{"createdOn":"2014-12-04","publishedOn":"2014-12-04","tz":"UTC"}<div class="post-meta">This post is guest authored by our friends at <a href="http://www.technicolor.com/" target="_blank">Technicolor</a>, whose Virdata platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>About Virdata</h2> Virdata is Technicolor’s cloud-native Internet of Things platform offering real-time monitoring, configuration and management of the unprecedented number of connected devices and applications. Combining its highly-scalable data ingestion and messaging capabilities with real-time and historical analytics, Virdata brings value across multiple data-driven markets. The Virdata platform was launched at CES Las Vegas in January, 2014. The Virdata cloud-based platform architecture integrates state-of-the-art open source software components into a homogeneous, high-availability data-processing environment. <h2>Virdata and Apache Spark</h2> The Virdata solution architecture comprises 3 areas:...
["by Databricks Press Office"]["Announcements","Company Blog"]{"createdOn":"2015-01-13","publishedOn":"2015-01-13","tz":"UTC"}<strong>Highlights:</strong> <ul> <li>Databricks Expands Bay Area Presence, Moves HQ to San Francisco</li> <li>Company Names Kavitha Mariappan as Marketing Vice President</li> </ul> Press Release: <a title="http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html" href="http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html">http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html</a> <strong>San Francisco, Calif. – January 13, 2015 – </strong><a href="http://www.databricks.com">Databricks</a>, the company founded by the creators of the popular open-source Big Data processing engine Apache Spark with its flagship product, Databricks Cloud, today announced the relocation of their headquarters to San Francisco from Berkeley, California. The expansion is a reflection of Databricks’ growth heading into 2015. The company grew more than 200 percent in headcount over the last year and adds talent to its executive ...
["Kavitha Mariappan"]["Announcements","Company Blog"]{"createdOn":"2015-01-16","publishedOn":"2015-01-16","tz":"UTC"}Complementing our on-going direct and partner-led Apache Spark training efforts, Databricks has teamed up with O’Reilly to offer the industry’s first standard for measuring and validating a developer’s expertise with Spark. Databricks and O’Reilly are proud to announce the online availability of the Spark Certified Developer exams. You can now sign up and take the exam online<a href=" http://go.databricks.com/spark-certified-developer"> here</a>. <b>What is the Spark Certified Developer program?</b> Apache Spark is the most active project in the Big Data ecosystem and is fast becoming the open source alternative of choice for many enterprises. Spark provides enterprises with the scale and sophistication they require to gain insights from their Big Data by providing a unified framework for building data pipelines. Databricks was founded by the team that created and continues to lead both development and training around Spark, and<a href="https://databricks.com/product"> Databricks Cl...
["Kavitha Mariappan"]["Company Blog","Events"]{"createdOn":"2015-01-20","publishedOn":"2015-01-20","tz":"UTC"}We are thrilled to announce the availability of the <a href="http://go.spark-summit.org/e1t/c/*W6stDzJ6_3DYhW6Y-qp35L8r5j0/*W4PZ7v36VwsQzW58WPXZ57MJJH0/5/f18dQhb0Sq5z8YHrDTW8HLj0x5VQHw7W6bFhBV6P7FhxW4R4BZM57mvC2W1BQYgg4P0TLvW85Q81T83G7d1W9dtj1h7NQNCqW4zWTRG33K-8nW7NMj-x9bTNXYW954KlM4P0Yt6W2d4hSK3bWrh8W2YH1kR47xfHKW2HRyfR6trFPNW47YlYy4bfcHbW47Xx4z3C811XW4-SZvb2KQ2YYW3_VZwP5ThdHgW3s1XjF51G0BJW4Zh8Y-57-WqMW3H_Pty2DzCtRW1zBkSq1sQ3b4W8V-D1g5rcXhJW7JS0c27BQjYmVJB4Mm896Q7XW94B_1g7v78c8W8NqNPC5qWyC0W7JTtyJ2Xm03sW3FBZ5D9lNHw9W6_b40v3vyNkPW6J4Ypk8lBfs0W3bnqM_1C-9rFVL--5_1Pct9JW2mPjk95hqX5PW9lKhck4H6s3gN4m21WR6Q977Vb98_P6s16_2W8Ph58-59BvQ0W7y34GD1FmQY-W7r71Hq2PhWHMW7tprCG95RqNQW2j-Sgt2L5GhqW3G6xft6TMH99W6-cC_w3wXTtZW6Sytzy9fTwQmN3FYx-Q_HpmRf6dY7D511" target="_blank">agenda</a> for Spark Summit East 2015! This inaugural New York City event on <span class="aBn" tabindex="0" data-term="goog_929332804"><span class="aQJ">March 18-19, 2015</span></span> has over thirty jam-packed sessions – offering a ...
["Yin Huai (Databricks)"]["Apache Spark","Engineering Blog"]{"createdOn":"2015-02-02","publishedOn":"2015-02-02","tz":"UTC"}[sidenote]Note: Starting Spark 1.3, SchemaRDD will be renamed to DataFrame.[/sidenote] <hr /> In this blog post, we introduce Spark SQL’s JSON support, a feature we have been working on at Databricks to make it dramatically easier to query and create JSON data in Spark. With the prevalence of web and mobile applications, JSON has become the de-facto interchange format for web service API’s as well as long-term storage. With existing tools, users often engineer complex pipelines to read and write JSON data sets within analytical systems. Spark SQL’s JSON support, released in Apache Spark 1.1 and enhanced in Apache Spark 1.2, vastly simplifies the end-to-end-experience of working with JSON data.<!--more--> <h2>Existing practices</h2> In practice, users often face difficulty in manipulating JSON data with modern analytical systems. To write a dataset to JSON format, users first need to write logic to convert their data to JSON. To read and query JSON datasets, a common practice is to us...
["Jeremy Freeman (Howard Hughes Medical Institute)"]["Apache Spark","Engineering Blog","Streaming"]{"createdOn":"2015-01-28","publishedOn":"2015-01-28","tz":"UTC"}Many real world data are acquired sequentially over time, whether messages from social media users, time series from wearable sensors, or — in a case we are particularly excited about — the firing of large populations of neurons. In these settings, rather than wait for all the data to be acquired before performing our analyses, we can use streaming algorithms to identify patterns over time, and make more targeted predictions and decisions. One simple strategy is to build machine learning models on static data, and then use the learned model to make predictions on an incoming data stream. But what if the patterns in the data are themselves dynamic? That's where streaming algorithms come in. A key advantage of Apache Spark is that its machine learning library (MLlib) and its library for stream processing (Spark Streaming) are built on the same core architecture for distributed analytics. This facilitates adding extensions that leverage and combine components in novel ways without reinv...
["Dave Wang (Databricks)"]["Announcements","Company Blog"]{"createdOn":"2015-02-05","publishedOn":"2015-02-05","tz":"UTC"}Recently <a href="http://www.infoworld.com/article/2871935/application-development/infoworlds-2015-technology-of-the-year-award-winners.html" target="_blank">Infoworld unveiled the 2015 Technology of the Year Award winners</a>, which range from open source software to stellar consumer technologies like the iPhone.  Being the <a title="Announcing Spark 1.2" href="https://databricks.com/blog/2014/12/19/announcing-spark-1-2.html" target="_blank">creators behind Apache Spark</a>, Databricks is thrilled to see Spark in their ranks.  In fact, we built our flagship product, <a title="Databricks Cloud Overview" href="https://databricks.com/product">Databricks</a>, on top of Spark with the ambition to revolutionize big data processing in ways similar to how iPhone revolutionized the mobile experience. The iPhone was revolutionary in a number of ways: first, it integrated a disparate set of consumer electronic capabilities such as mobile phone, camera, GPS, and even laptop; second, it created a...
["Patrick Wendell"]["Apache Spark","Engineering Blog"]{"createdOn":"2014-12-19","publishedOn":"2014-12-19","tz":"UTC"}We at Databricks are thrilled to announce the release of Apache Spark 1.2! Apache Spark 1.2 introduces many new features along with scalability, usability and performance improvements. This post will introduce some key features of Apache Spark 1.2 and provide context on the priorities of Spark for this and the next release. In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Apache Spark 1.2 has been posted today on the <a href="http://spark.apache.org/releases/spark-release-1-2-0.html">Apache Spark website</a>. Learn more about specific new features in related in-depth posts: <ul> <li><a title="Spark SQL Data Sources API: Unified Data Access for the Spark Platform" href="https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html" target="_blank">Spark SQL data sources API</a></li> <li><a title="An introduction to JSON support in Spark SQL" href="https:/...
["Xiangrui Meng","Patrick Wendell"]["Apache Spark","Ecosystem","Engineering Blog"]{"createdOn":"2014-12-22","publishedOn":"2014-12-22","tz":"UTC"}Today, we are happy to announce <em>Apache Spark Packages</em> (<a title="http://spark-packages.org" href="http://spark-packages.org">http://spark-packages.org</a>), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. <em>Spark Packages</em> makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. <!--more--> <em>Spark Packages</em> will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes <a href="http://spark-packages.org/package/6">scientific computing libraries</a>, a <a href="http://spark-packages.org/package/10">job execution server</a>, a connector for <a href="http://spark-packages.org/package/3">importing Avro data</a>, tool...
["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"]["Engineering Blog","Machine Learning"]{"createdOn":"2015-01-07","publishedOn":"2015-01-07","tz":"UTC"}MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in each release, a great deal of time and effort has been spent on making MLlib <i>easy</i>. Similar to Spark Core, MLlib provides APIs in three languages: Python, Java, and Scala, along with user guide and example code, to ease the learning curve for users coming from different backgrounds. In Apache Spark 1.2, Databricks, jointly with AMPLab, UC Berkeley, continues this effort by introducing a pipeline API to MLlib for easy creation and tuning of practical ML pipelines. A practical ML pipeline often involves a sequence of data pre-processing, feature extraction, model fitting, and validation stages. For example, classifying text documents might involve text segmentation and cleaning, extracting features, and training a classification model with cross-validation. Though there are many libraries we can use for each stage, connecting the dots ...
["Michael Armbrust"]["Apache Spark","Engineering Blog"]{"createdOn":"2015-01-09","publishedOn":"2015-01-09","tz":"UTC"}Since the inception of Spark SQL in Apache Spark 1.0, one of its most popular uses has been as a conduit for pulling data into the Spark platform.  Early users loved Spark SQL’s support for reading data from existing Apache Hive tables as well as from the popular Parquet columnar format. We’ve since added support for other formats, such as <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets">JSON</a>.  In Apache Spark 1.2, we've taken the next step to allow Spark to integrate natively with a far larger number of input sources.  These new integrations are made possible through the inclusion of the new Spark SQL Data Sources API. <a href="https://databricks.com/wp-content/uploads/2015/01/DataSourcesApiDiagram.png"><img class="wp-image-2372 aligncenter" src="https://databricks.com/wp-content/uploads/2015/01/DataSourcesApiDiagram-1024x526.png" alt="DataSourcesApiDiagram" width="516" height="265" /></a> The Data Sources API provides a pluggable mechanism...
["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"]["Apache Spark","Engineering Blog","Machine Learning"]{"createdOn":"2015-01-21","publishedOn":"2015-01-21","tz":"UTC"}<div class="post-meta">This is a post written together with Manish Amde from <a href="http://www.origamilogic.com/">Origami Logic</a>.</div> <hr /> Apache Spark 1.2 introduces <a href="http://en.wikipedia.org/wiki/Random_forest">Random Forests</a> and <a href="http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting">Gradient-Boosted Trees (GBTs)</a> into MLlib. Suitable for both classification and regression, they are among the most successful and widely deployed machine learning methods. Random Forests and GBTs are <i>ensemble learning algorithms</i>, which combine multiple decision trees to produce even more powerful models. In this post, we describe these models and the distributed implementation in MLlib. We also present simple examples and provide pointers on how to get started. <h2>Ensemble Methods</h2> Simply put, <a href="http://en.wikipedia.org/wiki/Ensemble_learning">ensemble learning algorithms</a> build upon other machine learning methods by combining models...
["Tathagata Das"]["Apache Spark","Engineering Blog","Streaming"]{"createdOn":"2015-01-15","publishedOn":"2015-01-15","tz":"UTC"}Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Apache Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Apache Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications. <h2>Background</h2> Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Since Spark Streaming is built on Spark, it enjoys the same fault-tolerance for worker nodes. However, the demand of high uptimes ...
["Kavitha Mariappan"]["Announcements","Company Blog"]{"createdOn":"2015-01-27","publishedOn":"2015-01-27","tz":"UTC"}In partnership with <a href="https://typesafe.com/">Typesafe</a>, we are excited to see the publication of the <a href="http://info.typesafe.com/COLL-20XX-Spark-Survey-Report_LP.html?lst=PR&amp;lsd=COLL-20XX-Spark-Survey-Trends-Adoption-Report">survey report</a> representing the largest poll of Apache Spark developers to date. Spark is currently the most active open source project in big data and has been rapidly gaining traction over the past few years. This survey of over 2100 respondents further validates the wide variety of use cases and environments where it is being deployed. The survey results indicate that 13% are already using Spark in production environments with 20% of the respondents with plans to deploy Spark in production environments in 2015, and 31% are currently in the process of evaluating it. In total, the survey covers over 500 enterprises that are using or planning to use Spark in production environments ranging from on-premise Hadoop clusters to public clouds, wi...
["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"]["Announcements","Company Blog"]{"createdOn":"2015-02-09","publishedOn":"2015-02-09","tz":"UTC"}<a href="https://databricks.com/wp-content/uploads/2015/02/large-oreilly-book-cover.jpg"><img class="size-medium wp-image-2486 aligncenter" src="https://databricks.com/wp-content/uploads/2015/02/large-oreilly-book-cover-228x300.jpg" alt="large oreilly book cover" width="228" height="300" /></a> Today we are happy to announce that the complete <a href="http://shop.oreilly.com/product/0636920028512.do" target="_blank"><i>Learning Spark</i></a> book is available from O’Reilly in e-book form with the print copy expected to be available February 16th. At Databricks, as the creators behind Apache Spark, we have witnessed <a title="Big data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction!" href="https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html" target="_blank">explosive growth in the interest and adoption ...
null["Announcements","Company Blog","Customers"]{"createdOn":"2015-02-13","publishedOn":"2015-02-13","tz":"UTC"}We're really excited to share that <a href="http://www.automatic.com">Automatic Labs </a>has selected Databricks as its preferred big data processing platform. Press release: <a href="http://www.marketwired.com/press-release/automatic-labs-turns-databricks-cloud-faster-innovation-dramatic-cost-savings-1991316.htm" target="_blank">http://www.marketwired.com/press-release/automatic-labs-turns-databricks-cloud-faster-innovation-dramatic-cost-savings-1991316.htm</a> Automatic Labs needed to run large and complex queries against their entire data set to explore and come up with new product ideas. Their prior solution using Postgres impeded the ability of Automatic’s team to efficiently explore data because queries took days to run and data could not be easily visualized, preventing Automatic Labs from bringing critical new products to market. They then deployed Databricks, our simple yet powerful unified big data processing platform on Amazon Web Services (AWS) and realized these key bene...
null["Apache Spark","Engineering Blog"]{"createdOn":"2015-02-14","publishedOn":"2015-02-14","tz":"UTC"}2014 has been a year of <a href="https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html">tremendous growth</a> for Apache Spark.  It became the most active open source project in the Big Data ecosystem with over 400 contributors, and was adopted by many platform vendors - including all of the major Hadoop distributors.  Through our ecosystem of products, partners, and training at Databricks, we also saw over 200 enterprises deploying Spark in production. To help Spark achieve this growth, Databricks has worked broadly throughout the project to improve functionality and ease of use. Indeed, while the community has grown a lot, about 75% of the code added to Spark last year came from Databricks. In this post, we would like to highlight some of the additions we made to Spark in 2014, and provide a preview of our priorities in 2015. In general, our approach to developing Spar...
null["Company Blog","Partners"]{"createdOn":"2015-02-19","publishedOn":"2015-02-19","tz":"UTC"}This is a guest blog from our one of our partners: <a href="http://www.memsql.com/" target="_blank">MemSQL</a> <hr /> &nbsp; <h2>Summary</h2> Coupling operational data with the most advanced analytics puts data-driven business ahead. The MemSQL Apache Spark Connector enables such configurations. <h2>Meeting Transactional and Analytical Needs</h2> Transactional databases form the core of modern business operations. Whether that transaction is financial, physical in terms of inventory changes, or experiential in terms of a customer engagement, the transaction itself moves our business forward. But while transactions represent the state of our business, analytics tell us patterns of the past, and help us predict patterns of the future. Analytics can tell us what levers influence profitability and put us ahead of the pack. Success in digital business requires both transactional and analytical prowess, including the foremost means to analyze data. <h2>Speed and Agility with MemSQL and A...
null["Apache Spark","Engineering Blog"]{"createdOn":"2015-02-17","publishedOn":"2015-02-17","tz":"UTC"}Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When we first open sourced Apache Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens. As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind.  This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature: <ul> <li>Ability to scale from kilobytes o...
null["Company Blog","Events"]{"createdOn":"2015-02-24","publishedOn":"2015-02-24","tz":"UTC"}The Strata + Hadoop World Conference in San Jose last week was abuzz with "putting data to work" in keeping with this year's conference theme. This was a significant shift from last year's event where organizations were highly focused on getting their arms around their big data projects and being steeped in evaluating the multitude of tools of new technologies available. Last week's event highlighted what is top of mind for enterprises and developers alike - how to turn their big data initiatives and projects into real business results? One theme was loud and clear - Apache Spark's flame shone bright!  Derrick Harris from GigaOM summed this up aptly in his article "<a href="https://gigaom.com/2015/02/20/for-now-spark-looks-like-the-future-of-big-data/" target="_blank">For now, Spark looks like the future of big data</a>". To quote Derrick, <em>"Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop ...
null["Company Blog","Product"]{"createdOn":"2015-03-04","publishedOn":"2015-03-04","tz":"UTC"}<div class="article-body"> Enterprises have been collecting ever-larger amounts of data with the goal of extracting insights and creating value. Yet despite a few innovative companies who are able to successfully exploit big data, the promised returns of big data remain elusive beyond the grasp of many enterprises. One notable and rapidly growing open source technology that has emerged in the big data space is Apache Spark. Spark is an open source data processing framework that was built for speed, ease of use, and scale. Much of its benefits are due to how it unifies critical data analytics capabilities such as SQL, machine learning and streaming in a single framework. This enables enterprises to simultaneously achieve high performance computing at scale while simplifying their data processing infrastructure by avoiding the difficult integration of many disparate and difficult tools with a single powerful yet simple alternative. While Spark appears to have the potential to solve m...

Showing the first 156 rows.

Nested Data

Think of nested data as columns within columns.

For instance, look at the dates column.

datesDF = databricksBlogDF.select("dates")
display(datesDF)
{"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"}
{"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"}
{"createdOn":"2014-04-01","publishedOn":"2014-04-01","tz":"UTC"}
{"createdOn":"2014-03-27","publishedOn":"2014-03-27","tz":"UTC"}
{"createdOn":"2014-02-04","publishedOn":"2014-02-04","tz":"UTC"}
{"createdOn":"2014-01-02","publishedOn":"2014-01-02","tz":"UTC"}
{"createdOn":"2014-03-26","publishedOn":"2014-03-26","tz":"UTC"}
{"createdOn":"2014-03-21","publishedOn":"2014-03-21","tz":"UTC"}
{"createdOn":"2014-03-19","publishedOn":"2014-03-19","tz":"UTC"}
{"createdOn":"2014-03-03","publishedOn":"2014-03-03","tz":"UTC"}
{"createdOn":"2014-02-13","publishedOn":"2014-02-13","tz":"UTC"}
{"createdOn":"2014-02-11","publishedOn":"2014-02-11","tz":"UTC"}
{"createdOn":"2014-01-22","publishedOn":"2014-01-22","tz":"UTC"}
{"createdOn":"2013-12-20","publishedOn":"2013-12-20","tz":"UTC"}
{"createdOn":"2013-12-19","publishedOn":"2013-12-19","tz":"UTC"}
{"createdOn":"2013-11-22","publishedOn":"2013-11-22","tz":"UTC"}
{"createdOn":"2013-10-29","publishedOn":"2013-10-29","tz":"UTC"}
{"createdOn":"2013-10-28","publishedOn":"2013-10-28","tz":"UTC"}
{"createdOn":"2013-10-27","publishedOn":"2013-10-27","tz":"UTC"}
{"createdOn":"2014-04-11","publishedOn":"2014-04-11","tz":"UTC"}
{"createdOn":"2014-04-15","publishedOn":"2014-04-15","tz":"UTC"}
{"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"}
{"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"}
{"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"}
{"createdOn":"2014-05-30","publishedOn":"2014-05-30","tz":"UTC"}
{"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"}
{"createdOn":"2014-06-04","publishedOn":"2014-06-04","tz":"UTC"}
{"createdOn":"2014-06-11","publishedOn":"2014-06-11","tz":"UTC"}
{"createdOn":"2014-06-12","publishedOn":"2014-06-12","tz":"UTC"}
{"createdOn":"2014-06-13","publishedOn":"2014-06-13","tz":"UTC"}
{"createdOn":"2014-06-23","publishedOn":"2014-06-23","tz":"UTC"}
{"createdOn":"2014-06-24","publishedOn":"2014-06-24","tz":"UTC"}
{"createdOn":"2014-06-26","publishedOn":"2014-06-26","tz":"UTC"}
{"createdOn":"2014-06-28","publishedOn":"2014-06-28","tz":"UTC"}
{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}
{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}
{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}
{"createdOn":"2014-04-29","publishedOn":"2014-04-29","tz":"UTC"}
{"createdOn":"2014-05-08","publishedOn":"2014-05-08","tz":"UTC"}
{"createdOn":"2014-04-30","publishedOn":"2014-04-30","tz":"UTC"}
{"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"}
{"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"}
{"createdOn":"2014-07-02","publishedOn":"2014-07-02","tz":"UTC"}
{"createdOn":"2014-07-14","publishedOn":"2014-07-14","tz":"UTC"}
{"createdOn":"2014-07-16","publishedOn":"2014-07-16","tz":"UTC"}
{"createdOn":"2014-07-19","publishedOn":"2014-07-19","tz":"UTC"}
{"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"}
{"createdOn":"2014-07-22","publishedOn":"2014-07-22","tz":"UTC"}
{"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"}
{"createdOn":"2014-08-08","publishedOn":"2014-08-08","tz":"UTC"}
{"createdOn":"2014-08-15","publishedOn":"2014-08-15","tz":"UTC"}
{"createdOn":"2014-08-27","publishedOn":"2014-08-27","tz":"UTC"}
{"createdOn":"2014-09-12","publishedOn":"2014-09-12","tz":"UTC"}
{"createdOn":"2014-09-16","publishedOn":"2014-09-16","tz":"UTC"}
{"createdOn":"2014-09-22","publishedOn":"2014-09-22","tz":"UTC"}
{"createdOn":"2014-09-15","publishedOn":"2014-09-15","tz":"UTC"}
{"createdOn":"2014-09-18","publishedOn":"2014-09-18","tz":"UTC"}
{"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"}
{"createdOn":"2014-09-19","publishedOn":"2014-09-19","tz":"UTC"}
{"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"}
{"createdOn":"2014-09-30","publishedOn":"2014-09-30","tz":"UTC"}
{"createdOn":"2014-09-25","publishedOn":"2014-09-25","tz":"UTC"}
{"createdOn":"2014-10-01","publishedOn":"2014-10-01","tz":"UTC"}
{"createdOn":"2014-10-07","publishedOn":"2014-10-07","tz":"UTC"}
{"createdOn":"2014-10-09","publishedOn":"2014-10-09","tz":"UTC"}
{"createdOn":"2014-10-10","publishedOn":"2014-10-10","tz":"UTC"}
{"createdOn":"2014-10-20","publishedOn":"2014-10-20","tz":"UTC"}
{"createdOn":"2014-10-15","publishedOn":"2014-10-15","tz":"UTC"}
{"createdOn":"2014-10-23","publishedOn":"2014-10-23","tz":"UTC"}
{"createdOn":"2014-10-27","publishedOn":"2014-10-27","tz":"UTC"}
{"createdOn":"2014-10-31","publishedOn":"2014-10-31","tz":"UTC"}
{"createdOn":"2014-11-25","publishedOn":"2014-11-25","tz":"UTC"}
{"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"}
{"createdOn":"2014-12-09","publishedOn":"2014-12-09","tz":"UTC"}
{"createdOn":"2014-11-05","publishedOn":"2014-11-05","tz":"UTC"}
{"createdOn":"2014-11-14","publishedOn":"2014-11-14","tz":"UTC"}
{"createdOn":"2014-11-15","publishedOn":"2014-11-15","tz":"UTC"}
{"createdOn":"2014-11-22","publishedOn":"2014-11-22","tz":"UTC"}
{"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"}
{"createdOn":"2014-12-04","publishedOn":"2014-12-04","tz":"UTC"}
{"createdOn":"2015-01-13","publishedOn":"2015-01-13","tz":"UTC"}
{"createdOn":"2015-01-16","publishedOn":"2015-01-16","tz":"UTC"}
{"createdOn":"2015-01-20","publishedOn":"2015-01-20","tz":"UTC"}
{"createdOn":"2015-02-02","publishedOn":"2015-02-02","tz":"UTC"}
{"createdOn":"2015-01-28","publishedOn":"2015-01-28","tz":"UTC"}
{"createdOn":"2015-02-05","publishedOn":"2015-02-05","tz":"UTC"}
{"createdOn":"2014-12-19","publishedOn":"2014-12-19","tz":"UTC"}
{"createdOn":"2014-12-22","publishedOn":"2014-12-22","tz":"UTC"}
{"createdOn":"2015-01-07","publishedOn":"2015-01-07","tz":"UTC"}
{"createdOn":"2015-01-09","publishedOn":"2015-01-09","tz":"UTC"}
{"createdOn":"2015-01-21","publishedOn":"2015-01-21","tz":"UTC"}
{"createdOn":"2015-01-15","publishedOn":"2015-01-15","tz":"UTC"}
{"createdOn":"2015-01-27","publishedOn":"2015-01-27","tz":"UTC"}
{"createdOn":"2015-02-09","publishedOn":"2015-02-09","tz":"UTC"}
{"createdOn":"2015-02-13","publishedOn":"2015-02-13","tz":"UTC"}
{"createdOn":"2015-02-14","publishedOn":"2015-02-14","tz":"UTC"}
{"createdOn":"2015-02-19","publishedOn":"2015-02-19","tz":"UTC"}
{"createdOn":"2015-02-17","publishedOn":"2015-02-17","tz":"UTC"}
{"createdOn":"2015-02-24","publishedOn":"2015-02-24","tz":"UTC"}
{"createdOn":"2015-03-04","publishedOn":"2015-03-04","tz":"UTC"}

Pull out a specific subfield with . (object) notation.

display(databricksBlogDF.select("dates.createdOn", "dates.publishedOn"))
2014-04-102014-04-10
2014-04-102014-04-10
2014-04-012014-04-01
2014-03-272014-03-27
2014-02-042014-02-04
2014-01-022014-01-02
2014-03-262014-03-26
2014-03-212014-03-21
2014-03-192014-03-19
2014-03-032014-03-03
2014-02-132014-02-13
2014-02-112014-02-11
2014-01-222014-01-22
2013-12-202013-12-20
2013-12-192013-12-19
2013-11-222013-11-22
2013-10-292013-10-29
2013-10-282013-10-28
2013-10-272013-10-27
2014-04-112014-04-11
2014-04-152014-04-15
2014-06-022014-06-02
2014-05-232014-05-23
2014-05-232014-05-23
2014-05-302014-05-30
2014-06-022014-06-02
2014-06-042014-06-04
2014-06-112014-06-11
2014-06-122014-06-12
2014-06-132014-06-13
2014-06-232014-06-23
2014-06-242014-06-24
2014-06-262014-06-26
2014-06-282014-06-28
2014-06-302014-06-30
2014-06-302014-06-30
2014-06-302014-06-30
2014-04-292014-04-29
2014-05-082014-05-08
2014-04-302014-04-30
2014-07-012014-07-01
2014-07-012014-07-01
2014-07-022014-07-02
2014-07-142014-07-14
2014-07-162014-07-16
2014-07-192014-07-19
2014-07-232014-07-23
2014-07-222014-07-22
2014-07-232014-07-23
2014-08-082014-08-08
2014-08-152014-08-15
2014-08-272014-08-27
2014-09-122014-09-12
2014-09-162014-09-16
2014-09-222014-09-22
2014-09-152014-09-15
2014-09-182014-09-18
2014-09-242014-09-24
2014-09-192014-09-19
2014-09-242014-09-24
2014-09-302014-09-30
2014-09-252014-09-25
2014-10-012014-10-01
2014-10-072014-10-07
2014-10-092014-10-09
2014-10-102014-10-10
2014-10-202014-10-20
2014-10-152014-10-15
2014-10-232014-10-23
2014-10-272014-10-27
2014-10-312014-10-31
2014-11-252014-11-25
2014-12-022014-12-02
2014-12-092014-12-09
2014-11-052014-11-05
2014-11-142014-11-14
2014-11-152014-11-15
2014-11-222014-11-22
2014-12-022014-12-02
2014-12-042014-12-04
2015-01-132015-01-13
2015-01-162015-01-16
2015-01-202015-01-20
2015-02-022015-02-02
2015-01-282015-01-28
2015-02-052015-02-05
2014-12-192014-12-19
2014-12-222014-12-22
2015-01-072015-01-07
2015-01-092015-01-09
2015-01-212015-01-21
2015-01-152015-01-15
2015-01-272015-01-27
2015-02-092015-02-09
2015-02-132015-02-13
2015-02-142015-02-14
2015-02-192015-02-19
2015-02-172015-02-17
2015-02-242015-02-24
2015-03-042015-03-04

Create a DataFrame, databricksBlog2DF that contains the original columns plus the new publishedOn column obtained from flattening the dates column.

from pyspark.sql.functions import col
databricksBlog2DF = databricksBlogDF.withColumn("publishedOn",col("dates.publishedOn"))

With this temporary view, apply the printSchema method to check its schema and confirm the timestamp conversion.

databricksBlog2DF.printSchema()
root |-- authors: array (nullable = true) | |-- element: string (containsNull = true) |-- categories: array (nullable = true) | |-- element: string (containsNull = true) |-- content: string (nullable = true) |-- creator: string (nullable = true) |-- dates: struct (nullable = true) | |-- createdOn: string (nullable = true) | |-- publishedOn: string (nullable = true) | |-- tz: string (nullable = true) |-- description: string (nullable = true) |-- id: long (nullable = true) |-- link: string (nullable = true) |-- slug: string (nullable = true) |-- status: string (nullable = true) |-- title: string (nullable = true) |-- publishedOn: string (nullable = true)

Both createdOn and publishedOn are stored as strings.

Cast those values to SQL timestamps:

In this case, use a single select method to:

  1. Cast dates.publishedOn to a timestamp data type
  2. "Flatten" the dates.publishedOn column to just publishedOn
from pyspark.sql.functions import to_timestamp
display(databricksBlogDF.select("title",to_timestamp("dates.publishedOn","yyyy-MM-dd").alias("publishedOn")))
MapR Integrates the Complete Apache Spark Stack2014-04-10T00:00:00.000+0000
Apache Spark 0.9.1 Released2014-04-10T00:00:00.000+0000
Application Spotlight: Alpine Data Labs2014-04-01T00:00:00.000+0000
Spark SQL: Manipulating Structured Data Using Apache Spark2014-03-27T00:00:00.000+0000
Apache Spark 0.9.0 Released2014-02-04T00:00:00.000+0000
Apache Spark In MapReduce (SIMR)2014-01-02T00:00:00.000+0000
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time2014-03-26T00:00:00.000+0000
Apache Spark: A Delight for Developers2014-03-21T00:00:00.000+0000
Databricks announces "Certified on Apache Spark" Program2014-03-19T00:00:00.000+0000
Apache Spark Now a Top-level Apache Project2014-03-03T00:00:00.000+0000
AMPLab updates the Big Data Benchmark2014-02-13T00:00:00.000+0000
Databricks at the O'Reilly Strata Conference 20142014-02-11T00:00:00.000+0000
Apache Spark and Hadoop: Working Together2014-01-22T00:00:00.000+0000
Apache Spark 0.8.1 Released2013-12-20T00:00:00.000+0000
Highlights From Spark Summit 20132013-12-19T00:00:00.000+0000
Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications2013-11-22T00:00:00.000+0000
Databricks and Cloudera Partner to Support Apache Spark2013-10-29T00:00:00.000+0000
The Growing Apache Spark Community2013-10-28T00:00:00.000+0000
Databricks and the Apache Spark Platform2013-10-27T00:00:00.000+0000
Databricks and MapR2014-04-11T00:00:00.000+0000
Making Apache Spark Easier to Use in Java with Java 82014-04-15T00:00:00.000+0000
Databricks Announces Apache Spark Training Workshops2014-06-02T00:00:00.000+0000
Application Spotlight: Atigeo xPatterns2014-05-23T00:00:00.000+0000
Pivotal Hadoop Integrates the Full Apache Spark Stack2014-05-23T00:00:00.000+0000
Announcing Apache Spark 1.02014-05-30T00:00:00.000+0000
Exciting Performance Improvements on the Horizon for Spark SQL2014-06-02T00:00:00.000+0000
MicroStrategy "Certified on Apache Spark"2014-06-04T00:00:00.000+0000
Application Spotlight: Arimo2014-06-11T00:00:00.000+0000
Spark Summit 2014 Brings Together Apache Spark Community2014-06-12T00:00:00.000+0000
Application Spotlight: Lightbend2014-06-13T00:00:00.000+0000
Application Spotlight: Apervi2014-06-23T00:00:00.000+0000
Application Spotlight: Qlik2014-06-24T00:00:00.000+0000
Databricks Launches "Certified Apache Spark Distribution" Program2014-06-26T00:00:00.000+0000
Application Spotlight: Elasticsearch2014-06-28T00:00:00.000+0000
Application Spotlight: Pentaho2014-06-30T00:00:00.000+0000
Sparkling Water = H20 + Apache Spark2014-06-30T00:00:00.000+0000
Databricks Unveils Apache Spark-Based Cloud Platform; Announces Series B Funding2014-06-30T00:00:00.000+0000
Databricks Application Spotlight at Spark Summit 20142014-04-29T00:00:00.000+0000
Databricks and Datastax2014-05-08T00:00:00.000+0000
Databricks Partners with Simba to Deliver Shark ODBC Driver2014-04-30T00:00:00.000+0000
Databricks Announces Partnership with SAP2014-07-01T00:00:00.000+0000
Integrating Apache Spark and HANA2014-07-01T00:00:00.000+0000
Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark2014-07-02T00:00:00.000+0000
Databricks: Making Big Data Easy2014-07-14T00:00:00.000+0000
New Features in MLlib in Apache Spark 1.02014-07-16T00:00:00.000+0000
The State of Apache Spark in 20142014-07-19T00:00:00.000+0000
Scalable Collaborative Filtering with Apache Spark MLlib2014-07-23T00:00:00.000+0000
Distributing the Singular Value Decomposition with Apache Spark2014-07-22T00:00:00.000+0000
Spark Summit 2014 Highlights2014-07-23T00:00:00.000+0000
When Stratio Met Apache Spark: A True Love Story2014-08-08T00:00:00.000+0000
Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao2014-08-15T00:00:00.000+0000
Statistics Functionality in Apache Spark 1.12014-08-27T00:00:00.000+0000
Announcing Apache Spark 1.12014-09-12T00:00:00.000+0000
Apache Spark 1.1: The State of Spark Streaming2014-09-16T00:00:00.000+0000
Apache Spark 1.1: MLlib Performance Improvements2014-09-22T00:00:00.000+0000
Application Spotlight: Talend2014-09-15T00:00:00.000+0000
Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark2014-09-18T00:00:00.000+0000
Databricks Reference Applications2014-09-24T00:00:00.000+0000
Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers2014-09-19T00:00:00.000+0000
Apache Spark Improves the Economics of Video Distribution at NBC Universal2014-09-24T00:00:00.000+0000
Scalable Decision Trees in MLlib2014-09-30T00:00:00.000+0000
Guavus Embeds Apache Spark into its Operational Intelligence Platform Deployed at the World’s Largest Telcos2014-09-25T00:00:00.000+0000
Apache Spark as a platform for large-scale neuroscience2014-10-01T00:00:00.000+0000
Sharethrough Uses Apache Spark Streaming to Optimize Advertisers' Return on Marketing Investment2014-10-07T00:00:00.000+0000
Application Spotlight: Trifacta2014-10-09T00:00:00.000+0000
Apache Spark the fastest open source engine for sorting a petabyte2014-10-10T00:00:00.000+0000
Efficient similarity algorithm now in Apache Spark, thanks to Twitter2014-10-20T00:00:00.000+0000
Application Spotlight: Tableau Software2014-10-15T00:00:00.000+0000
Spark Summit East - CFP now open2014-10-23T00:00:00.000+0000
Application Spotlight: Faimdata2014-10-27T00:00:00.000+0000
Hortonworks: A shared vision for Apache Spark on Hadoop2014-10-31T00:00:00.000+0000
Application Spotlight: Skytree Infinity2014-11-25T00:00:00.000+0000
Application Spotlight: Nube Reifier2014-12-02T00:00:00.000+0000
Pearson uses Apache Spark Streaming for next generation adaptive learning platform2014-12-09T00:00:00.000+0000
Apache Spark officially sets a new record in large-scale sorting2014-11-05T00:00:00.000+0000
Application Spotlight: Bedrock2014-11-14T00:00:00.000+0000
The Apache Spark Certified Developer Program2014-11-15T00:00:00.000+0000
Samsung SDS uses Apache Spark for prescriptive analytics at large scale2014-11-22T00:00:00.000+0000
Databricks to run two massive online courses on Apache Spark2014-12-02T00:00:00.000+0000
Application Spotlight: Technicolor Virdata Internet of Things platform2014-12-04T00:00:00.000+0000
Databricks Expands Bay Area Presence, Moves HQ to San Francisco2015-01-13T00:00:00.000+0000
Apache Spark Certified Developer exams available online!2015-01-16T00:00:00.000+0000
Spark Summit East 2015 Agenda is Now Available2015-01-20T00:00:00.000+0000
An introduction to JSON support in Spark SQL2015-02-02T00:00:00.000+0000
Introducing streaming k-means in Apache Spark 1.22015-01-28T00:00:00.000+0000
Apache Spark selected for Infoworld 2015 Technology of the Year Award2015-02-05T00:00:00.000+0000
Announcing Apache Spark 1.22014-12-19T00:00:00.000+0000
Announcing Apache Spark Packages2014-12-22T00:00:00.000+0000
ML Pipelines: A New High-Level API for MLlib2015-01-07T00:00:00.000+0000
Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform2015-01-09T00:00:00.000+0000
Random Forests and Boosting in MLlib2015-01-21T00:00:00.000+0000
Improved Fault-tolerance and Zero Data Loss in Apache Spark Streaming2015-01-15T00:00:00.000+0000
Big data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction!2015-01-27T00:00:00.000+0000
"Learning Spark" book available from O'Reilly2015-02-09T00:00:00.000+0000
Automatic Labs Selects Databricks for Primary Real-Time Data Processing2015-02-13T00:00:00.000+0000
Apache Spark: A review of 2014 and looking ahead to 2015 priorities2015-02-14T00:00:00.000+0000
Extending MemSQL Analytics with Apache Spark2015-02-19T00:00:00.000+0000
Introducing DataFrames in Apache Spark for Large Scale Data Science2015-02-17T00:00:00.000+0000
Databricks at Strata San Jose2015-02-24T00:00:00.000+0000
Databricks: From raw data, to insights and data products in an instant!2015-03-04T00:00:00.000+0000

Create another DataFrame, databricksBlog2DF that contains the original columns plus the new publishedOn column obtained from flattening the dates column.

databricksBlog2DF = databricksBlogDF.withColumn("publishedOn", to_timestamp("dates.publishedOn","yyyy-MM-dd")) 
display(databricksBlog2DF)
["Tomer Shiran (VP of Product Management at MapR)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at MapR, announcing our new partnership to provide enterprise support for Apache Spark as part of MapR's Distribution of Hadoop.</div> <hr /> With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution. These customers range from the world’s largest cable telcos and retailers to Silicon Valley startups such as Quantifind, which recently talked about its use of Spark on MapR in an <a href="http://www.datameer.com/ceoblog/big-data-brews-with-erich-nachbar/" target="_blank">interview</a> with Stefan Groschupf, CEO of Datameer. Today, I a...roy{"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"}null33https://databricks.com/blog/2014/04/10/mapr-integrates-spark-stack.htmlmapr-integrates-spark-stackpublishMapR Integrates the Complete Apache Spark Stack2014-04-10T00:00:00.000+0000
["Tathagata Das"]["Apache Spark","Engineering Blog","Machine Learning"]We are happy to announce the availability of <a href="http://spark.apache.org/releases/spark-release-0-9-1.html" target="_blank">Apache Spark 0.9.1</a>! This is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release since Spark graduated as a top level Apache project. Contributions to this release came from 37 developers. Visit the <a href="http://spark.apache.org/releases/spark-release-0-9-1.html" target="_blank">release notes</a> for more information about all the improvements and bug fixes. <a href="http://spark.apache.org/downloads.html" target="_blank">Download</a> it and try it out!tdas{"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"}null35https://databricks.com/blog/2014/04/09/spark-0_9_1-released.htmlspark-0_9_1-releasedpublishApache Spark 0.9.1 Released2014-04-10T00:00:00.000+0000
["Steven Hillion"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at Alpine Data Labs, part of the 'Application Spotlight' series highlighting innovative applications that are part of the Databricks "Certified on Apache Spark" program.</div> <hr /> Everyone knows how hard it is to recruit engineers and data scientists in Silicon Valley. At <a href="http://www.alpinenow.com" target="_blank">Alpine Data Labs</a>, we think what we’re up to is pretty fun and challenging, but we still have to compete with other start-ups as well as the big internet companies to attract the best talent. One thing that can help is to be able to say that you’re working with the most innovative and powerful technologies. Last year, I was interviewing a talented engineer with a strong background in machine learning. And he said that the one thing he wanted to do above all was to work with Apache Spark. “Will I get to do that at Alpine?” he asked. If it had been even a year earlier, I would have said “Sure…at...roy{"createdOn":"2014-04-01","publishedOn":"2014-04-01","tz":"UTC"}null37https://databricks.com/blog/2014/03/31/application-spotlight-alpine.htmlapplication-spotlight-alpinepublishApplication Spotlight: Alpine Data Labs2014-04-01T00:00:00.000+0000
["Michael Armbrust","Reynold Xin"]["Apache Spark","Engineering Blog"]Building a unified platform for big data analytics has long been the vision of Apache Spark, allowing a single program to perform ETL, MapReduce, and complex analytics. An important aspect of unification that our users have consistently requested is the ability to more easily import data stored in external sources, such as Apache Hive. Today, we are excited to announce <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html">Spark SQL</a>, a new component recently merged into the Spark repository. Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark’s distributed datasets) and in external sources. Spark SQL conveniently blurs the lines between RDDs and relational tables. Unifying these powerful abstractions makes it easy for developers to intermix SQL commands querying external data with complex analytics, all within in a single application. Concretely, Spark SQL will allow developers to: <ul> <li>I...michael{"createdOn":"2014-03-27","publishedOn":"2014-03-27","tz":"UTC"}null42https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.htmlspark-sql-manipulating-structured-data-using-spark-2publishSpark SQL: Manipulating Structured Data Using Apache Spark2014-03-27T00:00:00.000+0000
["Patrick Wendell"]["Apache Spark","Engineering Blog"]Our goal with Apache Spark is very simple: provide the best platform for computation on big data. We do this through both a powerful core engine and rich libraries for useful analytics tasks. Today, we are excited to announce the release of Apache Spark 0.9.0. This major release extends Spark’s libraries and further improves its performance and usability. Apache Spark 0.9.0 is the largest release to date, with work from 83 contributors, who submitted over 300 patches. Apache Spark 0.9 features significant extensions to the set of standard analytical libraries packaged with Spark. The release introduces GraphX, a library for graph computation that comes with implementations of several standard algorithms, such as PageRank. Spark’s machine learning library (MLlib) has been extended to support Python, using the NumPy numerical library. A Naive Bayes Classifier has also been added to MLlib. Finally, Spark Streaming, which supports near-real-time continuous computation, has added a simplif...patrick{"createdOn":"2014-02-04","publishedOn":"2014-02-04","tz":"UTC"}null58https://databricks.com/blog/2014/02/03/release-0_9_0.htmlrelease-0_9_0publishApache Spark 0.9.0 Released2014-02-04T00:00:00.000+0000
["Ali Ghodsi","Ahir Reddy"]["Apache Spark","Ecosystem","Engineering Blog"]Apache Hadoop integration has always been a key goal of Apache Spark and <a href="http://hortonworks.com/wp-content/uploads/2013/06/YARN.png">YARN</a> users have long been able to run <a href="http://spark.incubator.apache.org/docs/latest/running-on-yarn.html">Spark on YARN</a>. However, up to now, it has been relatively hard to run Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically, users would have to get permission to install Spark/Scala on some subset of the machines, a process that could be time consuming. Enter <a href="http://databricks.github.io/simr/">SIMR (Spark In MapReduce)</a>, which has been released in conjunction with <a href="https://databricks.com/blog/2013/12/19/release-0_8_1.html">Apache Spark 0.8.1</a>. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights, and without having Spark or Scal...ali{"createdOn":"2014-01-02","publishedOn":"2014-01-02","tz":"UTC"}null65https://databricks.com/blog/2014/01/01/simr.htmlsimrpublishApache Spark In MapReduce (SIMR)2014-01-02T00:00:00.000+0000
["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"]["Company Blog","Customers"]<div class="post-meta">We're very happy to see our friends at Cloudera continue to get the word out about Apache Spark, and their latest blog post is a great example of how users are putting Spark Streaming to use to solve complex problems in real time. Thanks to Russell Cardullo and Michael Ruggiero, Data Infrastructure Engineers at <a href="http://engineering.sharethrough.com/">Sharethrough</a>, for this <a href="http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/">guest post on Cloudera's blog</a>, which we've cross-posted below</div> <hr /> At Sharethrough, which offers an advertising exchange for delivering in-feed ads, we’ve been running on CDH for the past three years (after migrating from Amazon EMR), primarily for ETL. With the launch of our exchange platform in early 2013 and our desire to optimize content distribution in real time, our needs changed, yet CDH remains an important part of our infrastructure. In mid-2013, we began to examine stream-ba...roy{"createdOn":"2014-03-26","publishedOn":"2014-03-26","tz":"UTC"}null2409https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.htmlsharethrough-and-spark-streamingpublishSharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time2014-03-26T00:00:00.000+0000
["Jai Ranganathan","Matei Zaharia"]["Apache Spark","Engineering Blog"]<div class="post-meta"> This article was cross-posted in the <a href="http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/">Cloudera developer blog</a>. </div> <a href="http://spark.apache.org/">Apache Spark</a> is well known today for its <a href="http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/">performance benefits</a> over MapReduce, as well as its <a href="http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/">versatility</a>. However, another important benefit — the elegance of the development experience — gets less mainstream attention. In this post, you’ll learn just a few of the features in Spark that make development purely a pleasure. <h2>Language Flexibility</h2> Spark natively provides support for a variety of popular development languages. Out of the box, it supports Scala, Java, and Python, with some promising work ongoing <a href="http:/...matei{"createdOn":"2014-03-21","publishedOn":"2014-03-21","tz":"UTC"}null2410https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.htmlapache-spark-a-delight-for-developerspublishApache Spark: A Delight for Developers2014-03-21T00:00:00.000+0000
["Databricks Press Office"]["Announcements","Company Blog"]<strong>BERKELEY, Calif. – March 18, 2014 –</strong> Databricks, the company founded by the creators of Apache Spark that is revolutionizing what enterprises can do with Big Data, today announced the Databricks <a href="/certification/">“Certified on Spark” Program</a> for applications built on top of the Apache Spark platform. This program ensures that certified applications will work with a multitude of commercially supported Spark distributions. “Pioneering application developers that are leveraging the power of Spark have had to choose between two sub-optimal choices: they either have to package Spark platform support with their application or attempt to maintain integration/certification individually with a rapidly increasing set of commercially supported Spark distributions,” said Ion Stoica, Databricks CEO. “The Databricks ‘Certified on Spark’ program enables developers to certify solely against the 100% open-source Apache Spark distribution, and ensures interoperability with A...roy{"createdOn":"2014-03-19","publishedOn":"2014-03-19","tz":"UTC"}null2411https://databricks.com/blog/2014/03/18/spark-certification.htmlspark-certificationpublishDatabricks announces "Certified on Apache Spark" Program2014-03-19T00:00:00.000+0000
["Ion Stoica"]["Apache Spark","Engineering Blog"]<div class="blogContent"> We are delighted with the recent <a href="https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces50">announcement</a> of the Apache Software Foundation that <a href="http://spark.apache.org">Apache Spark</a> has become a top-level Apache project. This is a recognition of the fantastic work done by the Spark open source community, which now counts over 140 developers from 30+ companies. In short time, Spark has become an increasingly popular solution for numerous big data applications, including machine learning, interactive queries, and stream processing. Spark now is an integral part of the Hadoop ecosystem, with many organizations employing Spark to perform sophisticated processing on their Hadoop data. At Databricks we are looking forward to continuing our work with the open source community to accelerate the development and adoption of Apache Spark. Currently employing the lead developers and creators of many of the components...ion{"createdOn":"2014-03-03","publishedOn":"2014-03-03","tz":"UTC"}null2412https://databricks.com/blog/2014/03/02/spark-apache-top-level-project.htmlspark-apache-top-level-projectpublishApache Spark Now a Top-level Apache Project2014-03-03T00:00:00.000+0000
["Ahir Reddy","Reynold Xin"]["Apache Spark","Engineering Blog"]The AMPLab at UC Berkeley, with help from Databricks, recently released an update to the <a href="https://amplab.cs.berkeley.edu/benchmark/">Big Data Benchmark</a>. This benchmark uses Amazon EC2 to compare performance of five popular SQL query engines in the Big Data ecosystem on common types of queries, which can be reproduced through publicly available scripts and datasets. In the past year, the community has invested heavily in performance optimizations of query engines. We are glad to see that all projects have evolved in this area. Although the queries used in the benchmark are simple, we are proud that Shark remains one of the fastest engines for these workloads, and has improved significantly since the last run. While this benchmark reaffirms Shark as a highly performant SQL query engine, we are working hard at Databricks to push the boundaries further. Stay tuned for some exciting news we will share soon with the community. <ul> <li><a href="https://amplab.cs.berkeley.edu/b...rxin{"createdOn":"2014-02-13","publishedOn":"2014-02-13","tz":"UTC"}null2413https://databricks.com/blog/2014/02/12/big-data-benchmark.htmlbig-data-benchmarkpublishAMPLab updates the Big Data Benchmark2014-02-13T00:00:00.000+0000
["Pat McDonough"]["Company Blog","Events"]The Databricks team is excited to take part in a number of activities throughout the 2014 O’Reilly Strata Conference in Santa Clara. From hands-on training, to office hours, to several talks (including a keynote), there are plenty of chances for attendees to learn how Apache Spark is bringing ease of use and outstanding performance to your big data. The schedule for the Databricks team includes: <ul> <li><a href="http://ampcamp.berkeley.edu/4/">AMPCamp4</a>, Hosted at Strata</li> <li><a href="http://strataconf.com/strata2014/public/content/office-hours">Office Hours</a> on Wednesday at 5:45pm</li> <li><a href="http://strataconf.com/strata2014/public/schedule/detail/33057">How Companies are Using Spark, and Where the Edge in Big Data Will Be</a>, a keynote talk presented by Matei Zaharia on Thursday at 9:15am</li> <li><a href="http://strataconf.com/strata2014/public/schedule/detail/32375">Querying Petabytes of Data in Seconds with BlinkDB</a>, co-presented by Reynold Xin on Thur...pat.mcdonough{"createdOn":"2014-02-11","publishedOn":"2014-02-11","tz":"UTC"}null2414https://databricks.com/blog/2014/02/10/strata-santa-clara-2014.htmlstrata-santa-clara-2014publishDatabricks at the O'Reilly Strata Conference 20142014-02-11T00:00:00.000+0000
["Ion Stoica"]["Apache Spark","Ecosystem","Engineering Blog"]We are often asked how does <a href="http://spark.incubator.apache.org">Apache Spark</a> fits in the Hadoop ecosystem, and how one can run Spark in a existing Hadoop cluster. This blog aims to answer these questions. First, Spark is intended to <em>enhance</em>, not replace, the Hadoop stack. From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks. Second, we have constantly focused on making it as easy as possible for <em>every Hadoop user</em> to take advantage of Spark’s capabilities. No matter whether you run Hadoop 1.x or Hadoop 2.0 (YARN), and no matter whether you have administrative privileges to configure the Hadoop cluster or not, there is a way for you to run Spark! In particular, there are three ways to deploy Spark in a Hadoop cluster: standalone, YA...ion{"createdOn":"2014-01-22","publishedOn":"2014-01-22","tz":"UTC"}null2415https://databricks.com/blog/2014/01/21/spark-and-hadoop.htmlspark-and-hadooppublishApache Spark and Hadoop: Working Together2014-01-22T00:00:00.000+0000
["Patrick Wendell"]["Apache Spark","Engineering Blog"]We are happy to announce the release of Apache Spark 0.8.1. In addition to performance and stability improvements, this release adds three new features. First, Spark now supports for the newest versions of YARN (2.2+). Second, the standalone cluster manager supports a high-availability mode in which it can tolerate master failures. Third, shuffles have been optimized to create fewer files, improving shuffle performance drastically in some settings. In conjunction with the Apache Spark 0.8.1 release we are separately releasing <a href="https://databricks.com/blog/2014/01/01/simr.html">Spark In MapReduce (SIMR)</a>, which enables seamlessly running Spark on Hadoop MapReduce v1 clusters without requiring the installation of Scala or Spark. While Apache Spark 0.8.1 is a minor release, it includes these larger features for the benefit of Scala 2.9 users. The next major release of Apache Spark, 0.9.0, will be based on Scala 2.10. This release was a community effort, featuring contribution...patrick{"createdOn":"2013-12-20","publishedOn":"2013-12-20","tz":"UTC"}null2416https://databricks.com/blog/2013/12/19/release-0_8_1.htmlrelease-0_8_1publishApache Spark 0.8.1 Released2013-12-20T00:00:00.000+0000
["Andy Konwinski"]["Company Blog","Customers","Events"]Earlier this month we held the <a href="http://spark-summit.org/2013">first Spark Summit</a>, a conference to bring the Apache Spark community together. We are excited to share some statistics and highlights from the event. <ul> <li>450 participants from over 180 companies attended</li> <li>Participants came from 13 countries</li> <li>Spark training was sold out at 200 participants from 80 companies</li> <li>20 organizations sponsored the event, including all major Hadoop platform vendors</li> <li>20 different organizations gave talks</li> </ul> Videos and slides for all talks are now available on the <a href="http://spark-summit.org/2013">Summit 2013 page</a>. The Summit included Keynotes from Databricks, the UC Berkeley AMPLab, and Yahoo, as well as presentations from 18 other companies including Amazon, Red Hat, and Adobe. Talk topics covered a wide range including specialized applications such as mapping and manipulating the brain, product launches, and research projects...andy{"createdOn":"2013-12-19","publishedOn":"2013-12-19","tz":"UTC"}null2417https://databricks.com/blog/2013/12/18/spark-summit-2013-follow-up.htmlspark-summit-2013-follow-uppublishHighlights From Spark Summit 20132013-12-19T00:00:00.000+0000
["Pat McDonough"]["Apache Spark","Engineering Blog"][sidenote]A version of this post appears on the <a href="http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/">Cloudera Blog</a>.[/sidenote] <hr/> Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms. Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use. <h2 id="fast-and-easy-big-data-processing-with-spark">Fast and ...pat.mcdonough{"createdOn":"2013-11-22","publishedOn":"2013-11-22","tz":"UTC"}null2418https://databricks.com/blog/2013/11/21/putting-spark-to-use.htmlputting-spark-to-usepublishPutting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications2013-11-22T00:00:00.000+0000
["Ion Stoica"]["Company Blog","Partners"]Today, Cloudera announced that it will distribute and support Apache Spark. We are very excited about this announcement, and what it brings to the Spark platform and the open source community. So what does this announcement mean for Spark? First, it validates the maturity of the Spark platform. Started as a research project at UC Berkeley in 2009, Spark is the first general purpose cluster computing engine that can run sophisticated computations at memory speeds on Hadoop clusters. Spark started with the goal of providing efficient support for iterative algorithms (such as machine learning) and interactive queries, workloads not well supported by MapReduce. Since then, Spark has grown to support other applications such as streaming, and has gained rapid industry adoption. Today, Spark is used in production by numerous companies, and it counts on an ever growing open source community with over 90 contributors from 25 companies. Second, it will make the Spark platform available to a wi...ion{"createdOn":"2013-10-29","publishedOn":"2013-10-29","tz":"UTC"}null2419https://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.htmldatabricks-and-cloudera-partner-to-support-sparkpublishDatabricks and Cloudera Partner to Support Apache Spark2013-10-29T00:00:00.000+0000
["Matei Zaharia"]["Announcements","Company Blog"]This year has seen unprecedented growth in both the user and contributor communities around <a href="http://spark.incubator.apache.org">Apache Spark</a>. This rapid growth validates the tremendous potential of the platform, and shows the great excitement around it. While Spark started as a research project by a few grad students at UC Berkeley in 2009, today <strong>over 90 developers from 25 companies have contributed to Spark</strong>. This is not counting contributors to Shark (Hive on Spark), of which there are 25. Indeed, out of the many new big data engines created in the past few years, <strong>Spark has the largest development community after Hadoop MapReduce</strong>. We believe that new components in the project, like <a href="http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html">Spark Streaming</a> and <a href="http://spark.incubator.apache.org/docs/latest/mllib-guide.html">MLlib</a>, will only increase this growth. <h2>Growth by Numbers</h2> To gi...matei{"createdOn":"2013-10-28","publishedOn":"2013-10-28","tz":"UTC"}null2420https://databricks.com/blog/2013/10/27/the-growing-spark-community.htmlthe-growing-spark-communitypublishThe Growing Apache Spark Community2013-10-28T00:00:00.000+0000
["Ion Stoica","Matei Zaharia"]["Announcements","Company Blog"]When we announced that the original team behind <a href="http://spark.incubator.apache.org">Apache Spark</a> is starting a company around the project, we got a lot of excited questions. What areas will the company focus on, and what will it mean for the open source project? Today, in our first blog post at Databricks, we’re happy to share some of our goals, and say a little about what we’re doing next with Spark. To start with, our mission at Databricks is simple: we want to build the very best computing platform for extracting value from data. Big data is a tremendous opportunity that is still largely untapped, and we’ve been working for the past six years to transform what can be done with it. Going forward, we are fully committed to building out the open source Apache Spark platform to achieve this goal. <h2 id="how-we-think-about-big-data-speed-and-sophistication">How We Think about Big Data: Speed and Sophistication</h2> In the past few years, open source technologies like Hadoop...ion{"createdOn":"2013-10-27","publishedOn":"2013-10-27","tz":"UTC"}null2421https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.htmldatabricks-and-the-apache-spark-platformpublishDatabricks and the Apache Spark Platform2013-10-27T00:00:00.000+0000
["Arsalan Tavakoli-Shiraji"]["Company Blog","Partners"]Today, MapR announced that it will distribute and support the Apache Spark platform as part of the MapR Distribution for Hadoop in partnership with Databricks. We’re thrilled to start on this journey with MapR for a multitude of reasons. One of our primary goals at Databricks is to drive broad adoption of Spark and ensure everybody who uses it has a fantastic experience. This partnership will enable all of MapR’s enterprise customers, existing and new, to leverage Spark with the backing of the same great enterprise support available for the rest of MapR’s Hadoop Distribution. As Tomer mentioned in his <a href="/blog/2014/04/10/MapR-Integrates-Spark-Stack.html">blog post</a>, Spark is one of the most common topics in discussions with MapR’s existing customers and many are even already running it in production! A core part of Spark’s value proposition is the ability to easily build a unified end-to-end workflow where critical functions are first class citizens that are seamlessly integ...arsalan{"createdOn":"2014-04-11","publishedOn":"2014-04-11","tz":"UTC"}null2461https://databricks.com/blog/2014/04/10/partnership-between-databricks-and-mapr.htmlpartnership-between-databricks-and-maprpublishDatabricks and MapR2014-04-11T00:00:00.000+0000
["Prashant Sharma","Matei Zaharia"]["Apache Spark","Engineering Blog"]One of Apache Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of <a href="http://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html">lambda expressions</a> in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Apache Spark 1.0. <h2 id="a-few-examples">A Few Examples</h2> The following examples show how Java 8 makes code more concise. In our first example, we search a log file for lines that contain “error”, using Spark’s <code>filter</code> and <code>count</code> operations. The code is simple to write, but passing a Function object to <code>filter</code> is clunky: <h5 id="java-7-search-example">Java 7 search example:</h5> <pre>JavaRDD<String> lines = sc.textFile("hdfs://log.txt").filter( n...matei{"createdOn":"2014-04-15","publishedOn":"2014-04-15","tz":"UTC"}null12https://databricks.com/blog/2014/04/14/spark-with-java-8.htmlspark-with-java-8publishMaking Apache Spark Easier to Use in Java with Java 82014-04-15T00:00:00.000+0000
["Databricks Training Team"]["Announcements","Company Blog","Events"]Databricks is excited to launch its training program, starting with <a title="Spark Training" href="https://databricks.com/training">a series of hands-on Apache Spark workshops</a> designed by the creators of Apache Spark. The first workshop, <em>Introduction to Apache Spark</em>, establishes the fundamentals of using Spark for data exploration, analysis, and building big data applications. This one day workshop is hands-on, covering topics such as: interactively working with Spark's core APIs, learning the key concepts of big data, deploying applications on common Hadoop distributions, and unifying data pipelines with SQL, Streaming, and Machine Learning. Workshops are currently scheduled in New York, San Jose, Austin, and Chicago, with workshops in more cities to come. Visit <a title="Databricks Training" href="https://databricks.com/training">Databricks' training page</a> to find more information and please leave feedback there if you'd like to see a workshop in your area. <ul cla...pat.mcdonough{"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"}null273https://databricks.com/blog/2014/06/02/databricks-hands-on-technical-workshops.htmldatabricks-hands-on-technical-workshopspublishDatabricks Announces Apache Spark Training Workshops2014-06-02T00:00:00.000+0000
["Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://atigeo.com">Atigeo</a> announcing the certification of their xPatterns offering.</div> <hr /> Here at <a href="http://atigeo.com/">Atigeo</a>, we are always looking for ways to build on, improve, and expand our big data analytics platform, Atigeo xPatterns. More than that, both our development and product management team are focused on big data and on knowing what is right for our customers: data scientists and application developers at companies who are seeking to make the best possible use of their data assets. So we all stay on the lookout for the most useful, advanced, and best-performing set of technologies available. Apache Spark, for us, was a standout: We could see that making a dramatic performance improvement available to our customers and users would mean that xPattern’s analytics, modeling, and machine learning would be more responsive, and that Spark in xPatterns would give our customer...arsalan{"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"}null274https://databricks.com/blog/2014/05/22/application-spotlight-atigeo-xpatterns.htmlapplication-spotlight-atigeo-xpatternspublishApplication Spotlight: Atigeo xPatterns2014-05-23T00:00:00.000+0000
["Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.gopivotal.com" target="_blank">Pivotal</a> describing why they’re excited to deliver Apache Spark on their world class Pivotal HD big data analytics platform suite.</div> <hr /> Today, we are excited to announce the immediate availability of the full Apache Spark stack on Pivotal HD. We have been impressed with the rapid adoption of Spark as a replacement for Hadoop’s more traditional processing engines as well as its vibrant ecosystem, and are thrilled to make it possible for Pivotal customers to run Apache Spark on Pivotal HD Hadoop. Just as important is how we’re doing it: Pivotal HD will be part of Databricks’ upcoming certification program – meaning a commitment to provide compatibility with Apache Spark and support the growing ecosystem of Spark applications. <h2>PivotalHD and Spark</h2> Unlike a multi-vendor patchwork of heterogeneous solutions, Pivotal brings together an integrated ful...arsalan{"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"}null297https://databricks.com/blog/2014/05/23/pivotal-hadoop-integrates-the-full-apache-spark-stack.htmlpivotal-hadoop-integrates-the-full-apache-spark-stackpublishPivotal Hadoop Integrates the Full Apache Spark Stack2014-05-23T00:00:00.000+0000
["Patrick Wendell"]["Apache Spark","Engineering Blog"]Today, we’re very proud to announce the release of <a title="Spark 1.0.0 Release Notes" href="http://spark.apache.org/releases/spark-release-1-0-0.html">Apache Spark 1.0</a>. Apache Spark 1.0 is a major milestone for the Spark project that brings both numerous new features and strong API compatibility guarantees. The release is also a huge milestone for the Spark developer community: with more than 110 contributors over the past 4 months, it is Spark’s largest release yet, continuing a trend that has quickly made Spark the most active project in the Hadoop ecosystem. <h2>New Features</h2> What features are we most excited about in Apache Spark 1.0? While there are dozens of new features in the release, we’d like to highlight three. <b>Spark SQL</b> The biggest single addition to Apache Spark 1.0 is Spark SQL, a new module that <a title="Spark SQL" href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">we’ve previously blogged about</a>...patrick{"createdOn":"2014-05-30","publishedOn":"2014-05-30","tz":"UTC"}null502https://databricks.com/blog/2014/05/30/announcing-spark-1-0.htmlannouncing-spark-1-0publishAnnouncing Apache Spark 1.02014-05-30T00:00:00.000+0000
["Michael Armbrust","Zongheng Yang"]["Apache Spark","Engineering Blog"]With <a title="Announcing Spark 1.0" href="https://databricks.com/blog/2014/05/30/announcing-spark-1-0.html">Apache Spark 1.0</a> out the door, we’d like to give a preview of the next major initiatives in the Spark project. Today, the most active component of Spark is <a title="Spark SQL: Manipulating Structured Data Using Spark" href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">Spark SQL</a> - a tightly integrated relational engine that inter-operates with the core Spark API. Spark SQL was released in Spark 1.0, and will provide a lighter weight, agile execution backend for future versions of Shark. In this post, we’d like to highlight some of the ways in which tight integration into Scala and Spark provide us powerful tools to optimize query execution with Spark SQL. This post outlines one of the most exciting features, dynamic code generation, and explains what type of performance boost this feature can offer using queries from a...michael{"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"}null528https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.htmlexciting-performance-improvements-on-the-horizon-for-spark-sqlpublishExciting Performance Improvements on the Horizon for Spark SQL2014-06-02T00:00:00.000+0000
["Michael Hiskey (VP at MicroStrategy Inc.)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.microstrategy.com" target="_blank">MicroStrategy</a> describing why they're excited to have their platform "Certified on Apache Spark".</div> <hr /> <h2>The Need for Speed</h2> Over the past few years, we have seen Hadoop emerge as an effective foundation for many organizations’ big data management frameworks, but as the volume and varieties of data increase, speed continues to be a challenge. More and more of our customers are embracing Big Data, and the value of their investment is dependent on (and limited by) how quickly they can take data to action. We’ve been listening to our clients to understand how we can innovate to stay ahead of the curve to help solve these challenges. Apache Spark grabbed our attention because it addresses many of the limitations of Hadoop’s traditional functionality. Plus, Spark is simply impossible to ignore. The active, growing community of developers and enterpri...arsalan{"createdOn":"2014-06-04","publishedOn":"2014-06-04","tz":"UTC"}null569https://databricks.com/blog/2014/06/04/microstrategy-certified-on-spark.htmlmicrostrategy-certified-on-sparkpublishMicroStrategy "Certified on Apache Spark"2014-06-04T00:00:00.000+0000
["Christopher Nguyen (CEO &amp; Co-Founder of Adatao)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.arimo.com" target="_blank">Arimo</a> describing why and how they bet on Apache Spark.</div> <hr /> In early 2012, a group of engineers with background in distributed systems and machine learning came together to form Arimo. We saw a major unsolved problem in the nascent Hadoop ecosystem: it was largely a storage play. Data was sitting passively on HDFS, with very little value being extracted. To be sure, there was MapReduce, Hive, Pig, etc., but value is a strong function of (a) speed of computation, (b) sophistication of logic, and (c) ease of use. While Hadoop ecosystem was being developed well at the substrate, there was enormous opportunities above it left uncaptured. <strong>On speed:</strong> we had seen data move at-scale and at enormously faster rates in systems like Dremel and PowerDrill at Google. It enabled interactive behavior simply not available to Hadoop users. Without doubt, we k...arsalan{"createdOn":"2014-06-11","publishedOn":"2014-06-11","tz":"UTC"}null585https://databricks.com/blog/2014/06/11/application-spotlight-arimo.htmlapplication-spotlight-arimopublishApplication Spotlight: Arimo2014-06-11T00:00:00.000+0000
["Databricks Press Office"]["Company Blog","Events"]<ul> <li>Three-Day Event in San Francisco Invites Attendees to Gain Insights from the Leading Organizations in Big Data</li> <li>Keynote Speakers Include Executives from Databricks, Cloudera, MapR, DataStax, Jawbone and More</li> <li>Spark Summit Features Different Tracks for Applications, Development, Data Science and Research</li> </ul> &nbsp; BERKELEY, Calif.--(BUSINESS WIRE)-- Databricks and the sponsors of Spark Summit 2014 today announced the full agenda for the summit, including a host of exciting keynotes and community talks. The event will be held June 30–July 2, 2014, at The Westin St. Francis in San Francisco. Spark Summit 2014 arrives at an exciting time for the Apache Spark platform, which has become the most active open source project in the Hadoop ecosystem with more than 200 contributors in the past year. Now available in all major Hadoop distributions, Spark has fostered a fast-growing community on the strength of its technical capabilities, which make big data...scott{"createdOn":"2014-06-12","publishedOn":"2014-06-12","tz":"UTC"}null609https://databricks.com/blog/2014/06/11/spark-summit-2014-brings-together-apache-spark-community.htmlspark-summit-2014-brings-together-apache-spark-communitypublishSpark Summit 2014 Brings Together Apache Spark Community2014-06-12T00:00:00.000+0000
["Dean Wampler (Typesafe)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.lightbend.com" target="_blank">Lightbend</a> after having their Lightbend Activator Apache Spark templates be "Certified on Apache Spark".</div> <hr /> <h2>Apache Spark and the Lightbend Reactive Platform: A Match Made in Heaven</h2> When I started working with Hadoop several years ago, it was frustrating to find that writing Hadoop jobs was hard to do. If your problem fits a query model, then <a title="Hive" href="http://hive.apache.org" target="_blank">Hive</a> provides a SQL-based scripting tool. For many common dataflow problems, <a href="http://pig.apache.org" target="_blank">Pig</a> provides useful abstractions, but it isn't a full-fledged, "Turing-complete" language. Otherwise, you had to use the low-level <a href="http://wiki.apache.org/hadoop/MapReduce" target="_blank">Hadoop MapReduce</a> API. Some third-party APIs exist that wrap the MapReduce API, such as <a href="http://cascading.org...arsalan{"createdOn":"2014-06-13","publishedOn":"2014-06-13","tz":"UTC"}null628https://databricks.com/blog/2014/06/13/application-spotlight-lightbend.htmlapplication-spotlight-lightbendpublishApplication Spotlight: Lightbend2014-06-13T00:00:00.000+0000
["Hari Kodakalla (EVP at Apervi Inc.)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.apervi.com" target="_blank">Apervi</a> after having their Conflux Director™ application be "Certified on Apache Spark".</div> <hr /> <h2>Big Data on Steroids with Apache Spark</h2> As big data takes center stage in the new data explosion, Hadoop has emerged as one the leading technologies addressing the challenges in the space. As the data processing needs of enterprises are growing newer technologies like Apache Spark have emerged as significant options that consistently offer expanded capabilities for the big data space. As these enterprise needs are met, so is the increased appetite for faster processing, low latency requirements for high velocity data and an iterative demand for processing where leading technologies like Hadoop fall short of expectations or at times seem cumbersome to implement due to its inherent design. Delivering on this growing need of enterprises is where Spark plays a ...arsalan{"createdOn":"2014-06-23","publishedOn":"2014-06-23","tz":"UTC"}null643https://databricks.com/blog/2014/06/23/application-spotlight-apervi.htmlapplication-spotlight-apervipublishApplication Spotlight: Apervi2014-06-23T00:00:00.000+0000
["Bill Kehoe (Big Data Architect at Qlik)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.qlik.com" target="_blank">Qlik</a> describing how Apache Spark enables the full power of QlikView, recently Certified on Apache Spark, and its Associative Experience feature over the entire HDFS data set.</div> <hr /> <h2>The Power of Qlik</h2> Qlik provides software and services that help make understanding data a natural part of how people make decisions. Our product, QlikView, is the leading Business Discovery platform that incorporates a unique, associative experience that empowers business users to follow their own path to formulate and answer questions that lead to better decisions. Traditional, query-based BI tools force users thru pre-defined navigation paths which limit the kinds of questions that can be answered and require costly and time consuming revisions to address evolving business needs. In contrast, when a user selects data items using QlikView, all the fields and charts are imm...arsalan{"createdOn":"2014-06-24","publishedOn":"2014-06-24","tz":"UTC"}null651https://databricks.com/blog/2014/06/24/application-spotlight-qlik.htmlapplication-spotlight-qlikpublishApplication Spotlight: Qlik2014-06-24T00:00:00.000+0000
["Databricks Press Office"]["Announcements","Company Blog"]<em>Certified distributions maintain compatibility with open source Apache Spark distribution and thus support the growing ecosystem of Apache Spark applications</em> <hr /> <strong>BERKELEY, Calif. -- June 26, 2014 --</strong> Databricks, the company founded by the creators of Apache Spark, the next generation Big Data engine, today announced the <a href="https://databricks.com/spark/certification/certified-spark-distribution" target="_blank">“Certified Spark Distribution” </a>program for vendors with a commercial Spark distribution. Certification indicates that the vendor’s Spark distribution is compatible with the open source Apache Spark distribution, enabling “Certified on Spark” applications - certified to work with Apache Spark - to run on the vendor’s Spark distribution out-of-the-box. “One of Databricks’ goals is to ensure users have a fantastic experience. Our belief is that having the community work together to maintain compatibility and therefore facilitate a vibrant app...arsalan{"createdOn":"2014-06-26","publishedOn":"2014-06-26","tz":"UTC"}null703https://databricks.com/blog/2014/06/26/databricks-launches-certified-spark-distribution-program.htmldatabricks-launches-certified-spark-distribution-programpublishDatabricks Launches "Certified Apache Spark Distribution" Program2014-06-26T00:00:00.000+0000
["Costin Leau (Engineer at Elasticsearch)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.elasticsearch.com" target="_blank">Elasticsearch</a> announcing Elasticsearch is now "Certified on Apache Spark", the first step in a collaboration to provide tighter integration between Elasticsearch and Spark.</div> <hr /> <h2>Elasticsearch Now “Certified on Spark”</h2> Helping businesses get insights out of their data, fast, is core to the mission of Elasticsearch. Being able to live wherever a business stores their data is obviously critical to that mission, and Hadoop is one of the leaders in providing a way for businesses to store massive amounts of data at scale. Over the course of the past year, we have been working hard to bring the power of our real-time search and analytics engine to the Hadoop ecosystem. Our Hadoop connector, Elasticsearch for Apache Hadoop, is compatible with the top three Hadoop distributions – Cloudera, Hortonworks and MapR – and today has achieved another exciting...arsalan{"createdOn":"2014-06-28","publishedOn":"2014-06-28","tz":"UTC"}null713https://databricks.com/blog/2014/06/27/application-spotlight-elasticsearch.htmlapplication-spotlight-elasticsearchpublishApplication Spotlight: Elasticsearch2014-06-28T00:00:00.000+0000
["Jake Cornelius (SVP of Product Management at Pentaho)"]["Company Blog","Partners"][sidenote]This post is guest authored by our friends at <a href="http://www.pentaho.com" target="_blank">Pentaho</a> after having their data integration and analytics platform <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a>[/sidenote] <hr /> One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in <a href="http://www.pentaho.com/what-is-big-data" target="_blank">Big Data</a> to solve new challenges using the existing skill sets they have in their organizations today. Our Pentaho Labs prototyping and innovation efforts around natively integrating data engineering and analytics with Big Data platforms like <a href="http://www.pentaho.com/what-is-hadoop" target="_blank">Hadoop</a> and <a href="http://www.pentaho.com/storm" target="_blank">Storm</a> have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include <a href="http://www.pent...arsalan{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}null720https://databricks.com/blog/2014/06/30/application-spotlight-pentaho.htmlapplication-spotlight-pentahopublishApplication Spotlight: Pentaho2014-06-30T00:00:00.000+0000
["SriSatish Ambati (CEO of 0xData)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.0xdata.com" target="_blank">0xData</a> discussing the release of Sparkling Water - the integration of their H20 offering with the Apache Spark platform.</div> <hr /> <h3>H20 – The Killer-App on Apache Spark</h3> <img class="aligncenter size-full wp-image-62" src="https://databricks.com/wp-content/uploads/2014/06/Spark-+-H20.png" width="472" /> In-memory big data has come of age. The Apache Spark platform, with its elegant API, provides a unified platform for building data pipelines. H2O has focused on scalable machine learning as the API for big data applications. Spark + H2O combines the capabilities of H2O with the Spark platform – converging the aspirations of data science and developer communities. H2O is the Killer-Application for Spark. <img class="aligncenter size-full wp-image-62" src="https://databricks.com/wp-content/uploads/2014/06/H20-the-Killer-App.png" width="472" /> <h3>Backdrop<...arsalan{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}null732https://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.htmlsparkling-water-h20-sparkpublishSparkling Water = H20 + Apache Spark2014-06-30T00:00:00.000+0000
["Databricks Press Office"]["Announcements","Company Blog"]<ul> <li>Databricks Cloud Allows Users to Get Value from Apache Spark without the Challenges Normally Associated with Big Data Infrastructure</li> <li>Ease-of-Use of Turnkey Solution Brings the Power of Spark to a Wider Audience and Fuels the Growth of the Spark Ecosystem</li> <li>Funding Led by NEA with Follow-on Investment from Andreessen Horowitz</li> </ul> <strong>Berkeley, Calif. (June 30, 2014)</strong>—Databricks, the company founded by the creators of Apache Spark—the powerful open-source processing engine that provides blazingly fast and sophisticated analytics—announced today the launch of <a title="Databricks Cloud" href="https://databricks.com/cloud">Databricks Cloud</a>, a cloud platform built around Apache Spark. In addition to this launch, the company is announcing the close of $33 million in series B funding led by New Enterprise Associates (NEA) with follow-on investment from Andreessen Horowitz. “Getting the full value out of their Big Data investments is still...arsalan{"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"}null768https://databricks.com/blog/2014/06/30/databricks-unveils-spark-based-cloud-platform.htmldatabricks-unveils-spark-based-cloud-platformpublishDatabricks Unveils Apache Spark-Based Cloud Platform; Announces Series B Funding2014-06-30T00:00:00.000+0000
["Arsalan Tavakoli-Shiraji"]["Company Blog","Events"]At Databricks, we’ve been thrilled to see the rapid pace of adoption of Apache Spark, as it has been embraced by an increasing number of enterprise vendors and has grown to be the most active open source project in the Hadoop ecosystem. We also know that a critical piece of enabling enterprises to unlock its potential is a strong ecosystem of applications built on top of or integrated with Spark. We launched the <a href="http://www.databricks.com/certification/">“Certified on Apache Spark”</a> program to support these application developer efforts, and have been blown away at the diverse set of applications being built on top of Spark, and want this great work to be exposed to the broader community. In that light, this year’s Spark Summit will have an “Application Spotlight” segment that will highlight some of the best we’ve seen. Read on for details on how to apply and what selection entails. All applications eligible (even if not yet certified) for the Databricks “Certified on Spar...arsalan{"createdOn":"2014-04-29","publishedOn":"2014-04-29","tz":"UTC"}null2462https://databricks.com/blog/2014/04/28/databricks-application-spotlight-at-spark-summit-2014.htmldatabricks-application-spotlight-at-spark-summit-2014publishDatabricks Application Spotlight at Spark Summit 20142014-04-29T00:00:00.000+0000
["Arsalan Tavakoli-Shiraji"]["Company Blog","Partners"]<p>Today, Datastax and Databricks announced a partnership in which Apache Spark becomes an integral part of the Datastax offering, tightly integrated with Cassandra. We’re very excited to be embarking on this journey with Datastax for a multitude of reasons:</p> <h2 id="integrating-operational-systems-with-analytics">Integrating operational systems with analytics</h2> <p>One of the use cases that we’ve increasingly been asked about by Spark users is the ability to create a closed loop system: perform advanced analytics directly on operational data that is then fed back into the operational system to drive necessary adaptation. The tight integration of Cassandra and Spark will enable users to achieve this goal by leveraging Cassandra as the high-performance transactional database that powers online applications and Spark as a next generation processing engine that can deliver deeper insights, faster while seamlessly moving between the two.</p> <h2 id="spark-beyond-hadoop">Spark beyond...arsalan{"createdOn":"2014-05-08","publishedOn":"2014-05-08","tz":"UTC"}null2463https://databricks.com/blog/2014/05/08/databricks-and-datastax.htmldatabricks-and-datastaxpublishDatabricks and Datastax2014-05-08T00:00:00.000+0000
["Databricks Press Office"]["Announcements","Company Blog"] <p><strong>VANCOUVER, BC. – April 30, 2014 –</strong> Simba Technologies Inc., the industry’s expert for Big Data connectivity, announced today that Databricks has licensed Simba’s ODBC Driver as its standards-based connectivity solution for Shark, the SQL front-end for Apache Spark, the next generation Big Data processing engine. Founded by the creators of Apache Spark and Shark, Databricks is developing cutting-edge systems to enable enterprises to discover deeper insights, faster.</p> <p>“We believe that Big Data is a tremendous opportunity that is still largely untapped, and we are working to revolutionize what organizations can do with it,” says Ion Stoica, Chief Executive Officer at Databricks, and Professor of Computer Science at UC Berkeley. “As part of this mission, we understand that BI tools will continue to be a key medium for consuming data and analytics and are excited to announce the availability of an enterprise-grade connectivity option for users of BI tools. ...roy{"createdOn":"2014-04-30","publishedOn":"2014-04-30","tz":"UTC"}null2464https://databricks.com/blog/2014/04/30/databricks-partners-with-simba-to-deliver-shark-odbc-driver.htmldatabricks-partners-with-simba-to-deliver-shark-odbc-driverpublishDatabricks Partners with Simba to Deliver Shark ODBC Driver2014-04-30T00:00:00.000+0000
["Databricks Press Office"]["Announcements","Company Blog","Partners"]<strong>SAN FRANCISCO — July 1, 2014</strong> — Databricks, the company founded by the creators of Apache Spark – the popular open-source processing engine - today announced a new partnership with <a href="http://www.sap.com" target="_blank">SAP (NYSE: SAP)</a> and to deliver a Databricks-certified Apache Spark distribution offering for the SAP HANA® platform. The full production-ready distribution offering, based on Apache Spark 1.0, is deployable in the cloud or on premise and available for immediate download from SAP at no cost at <a href="http://spr.ly/SAP_and_Spark" target="_blank">spr.ly/SAP_and_Spark</a>. The announcement was made at the Spark Summit 2014, being held June 30 – July 2 in San Francisco. The Databricks-certified distribution offering for SAP HANA contains the Spark processing engine that works with any Hadoop distribution out of the box, providing a more complete data store and processing layer for Hadoop. Certified by Databricks to be compatible with the Apache ...arsalan{"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"}null782https://databricks.com/blog/2014/07/01/databricks-announces-partnership-with-sap.htmldatabricks-announces-partnership-with-sappublishDatabricks Announces Partnership with SAP2014-07-01T00:00:00.000+0000
["Arsalan Tavakoli-Shiraji"]["Company Blog","Partners"]This morning SAP released its own “Certified Spark Distribution” as part of a brand new partnership announced between Databricks and SAP. We’re thrilled to be embarking on this journey with them, not just because of what it means for Databricks as a company, but just as importantly because of what it means for Apache Spark and the Spark community. <h2>Access to the full corpus of data</h2> Fundamentally, every enterprise's big data vision is to convert data into value; a core ingredient in this quest is the availability of the data that needs to be mined for insights. Although the growth in volume of data sitting in HDFS has been incredible and continues to grow exponentially, much of this has been contextual data - e.g., social data, click-stream data, sensor data, logs, 3rd party data sources - and historical data. Real-time operational data - e.g., data from foundational enterprise applications such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), and S...arsalan{"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"}null785https://databricks.com/blog/2014/07/01/integrating-spark-and-hana.htmlintegrating-spark-and-hanapublishIntegrating Apache Spark and HANA2014-07-01T00:00:00.000+0000
["Reynold Xin"]["Apache Spark","Engineering Blog"]With the introduction of Spark SQL and the new Hive on Apache Spark effort (<a href="https://issues.apache.org/jira/browse/HIVE-7292">HIVE-7292</a>), we get asked a lot about our position in these two projects and how they relate to Shark. At the <a href="http://spark-summit.org/2014">Spark Summit</a> today, we announced that we are ending development of Shark and will focus our resources towards Spark SQL, which will provide a superset of Shark’s features for existing Shark users to move forward. In particular, Spark SQL will provide both a seamless upgrade path from Shark 0.9 server and new features such as integration with general Spark programs. <img class="alignnone wp-image-818 size-large" src="https://databricks.com/wp-content/uploads/2014/07/sql-directions-1024x691.png" alt="Future of SQL on Spark" width="400" /> <h2>Shark</h2> When the Shark project started 3 years ago, Hive (on MapReduce) was the only choice for SQL on Hadoop. Hive compiled SQL into scalable MapReduce jobs a...rxin{"createdOn":"2014-07-02","publishedOn":"2014-07-02","tz":"UTC"}null796https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.htmlshark-spark-sql-hive-on-spark-and-the-future-of-sql-on-sparkpublishShark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark2014-07-02T00:00:00.000+0000
["Ion Stoica"]["Company Blog","Product"]Our vision at Databricks is to <strong>make big data easy</strong> so that we enable <strong>every</strong> organization to turn its data into value. At Spark Summit 2014, we were very excited to unveil <a href="https://databricks.com/cloud" target="_blank">Databricks</a>, our first product towards fulfilling this vision. In this post, I’ll briefly go over the challenges that data scientists and data engineers face today when working with big data, and then show how Databricks addresses these challenges. <h2>Today’s Big Data Challenges</h2> While the promise of big data to <a href="http://spark-summit.org/2014/talk/using-spark-to-generate-analytics-for-international-cable-tv-video-distribution" target="_blank">improve businesses</a>, <a href="http://spark-summit.org/2014/talk/david-patterson" target="_blank">save lives</a>, and <a href="http://spark-summit.org/2014/talk/A-platform-for-large-scale-neuroscience" target="_blank">advance science</a> is becoming more and more real, analyzi...ion{"createdOn":"2014-07-14","publishedOn":"2014-07-14","tz":"UTC"}null865https://databricks.com/blog/2014/07/14/databricks-cloud-making-big-data-easy.htmldatabricks-cloud-making-big-data-easypublishDatabricks: Making Big Data Easy2014-07-14T00:00:00.000+0000
["Xiangrui Meng"]["Apache Spark","Engineering Blog","Machine Learning"]MLlib is an Apache Spark component focusing on machine learning. It became a standard component of Spark in version 0.8 (Sep 2013). The initial contribution was from Berkeley AMPLab. Since then, 50+ developers from the open source community have contributed to its codebase. With the release of Apache Spark 1.0, I’m glad to share some of the new features in MLlib. Among the most important ones are: <ul> <li>sparse data support</li> <li>regression and classification trees</li> <li>distributed matrices</li> <li>PCA and SVD</li> <li>L-BFGS optimization algorithm</li> <li>new user guide and code examples</li> </ul> This is the first in a series of blog posts about features and optimizations in MLlib. We will focus on one feature new in 1.0 — sparse data support. <h2>Large-scale ≈ Sparse</h2> When I was in graduate school, I wrote “large-scale sparse least squares” in a paper draft. My advisor crossed out the word “sparse” and left a comment: “Large-scale already implies sparsity...Xiangrui{"createdOn":"2014-07-16","publishedOn":"2014-07-16","tz":"UTC"}null909https://databricks.com/blog/2014/07/16/new-features-in-mllib-in-spark-1-0.htmlnew-features-in-mllib-in-spark-1-0publishNew Features in MLlib in Apache Spark 1.02014-07-16T00:00:00.000+0000
["Matei Zaharia"]["Apache Spark","Engineering Blog"]<div class="post-meta">This post originally appeared in <a href="http://inside-bigdata.com/2014/07/15/theres-spark-theres-fire-state-apache-spark-2014/" target="_blank">insideBIGDATA</a> and is reposted here with permission.</div> <hr /> With the second <a href="http://spark-summit.org/2014">Spark Summit</a> behind us, we wanted to take a look back at our journey since 2009 when Apache Spark, the fast and general engine for large-scale data processing, was initially developed. It has been exciting and extremely gratifying to watch Spark mature over the years, thanks in large part to the vibrant, open source community that latched onto it and busily began contributing to make Spark what it is today. The idea for Spark first emerged in the AMPLab (AMP stands for Algorithms, Machines, and People) at the University of California, Berkeley. With its significant industry funding and exposure, the AMPlab had a unique perspective on what is important and what issues exist among early adopte...matei{"createdOn":"2014-07-19","publishedOn":"2014-07-19","tz":"UTC"}null965https://databricks.com/blog/2014/07/18/the-state-of-apache-spark-in-2014.htmlthe-state-of-apache-spark-in-2014publishThe State of Apache Spark in 20142014-07-19T00:00:00.000+0000
["Burak Yavuz","Xiangrui Meng","Reynold Xin"]["Apache Spark","Engineering Blog","Machine Learning"]Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Python (<a href="http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html">Scala/Java APIs also available</a>).<!--more--> [python] from pyspark.mllib.recommendation import ALS # load training and test data into (user, product, rating) tuples def parseRating(line): fields = line.split() return (int(fields[0]), int(fields[1]), float(fields[2])) training = sc.textFile(&quot;...&quot;).map(parseRating).cache() test = sc.textFile(&quot;...&quot;).map(parseRating) # train a recommendation model model = ALS.train(tra...Xiangrui{"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"}null980https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.htmlscalable-collaborative-filtering-with-spark-mllibpublishScalable Collaborative Filtering with Apache Spark MLlib2014-07-23T00:00:00.000+0000
["Li Pu","Reza Zadeh"]["Apache Spark","Engineering Blog","Machine Learning"]<div class="post-meta">Guest post by Li Pu from Twitter and Reza Zadeh from Databricks on their recent contribution to Apache Spark's machine learning library.</div> <hr /> The <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition (SVD)</a> is one of the cornerstones of linear algebra and has widespread application in many real-world modeling situations. Problems such as recommender systems, linear systems, least squares, and many others can be solved using the SVD. It is frequently used in statistics where it is related to principal component analysis (PCA) and to correspondence analysis, and in signal processing and pattern recognition. Another usage is latent semantic indexing in natural language processing. Decades ago, before the rise of distributed computing, computer scientists developed the single-core <a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK package</a> for computing the eigenvalue decomposition of a matrix. Since...matei{"createdOn":"2014-07-22","publishedOn":"2014-07-22","tz":"UTC"}null1049https://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.htmldistributing-the-singular-value-decomposition-with-sparkpublishDistributing the Singular Value Decomposition with Apache Spark2014-07-22T00:00:00.000+0000
["Scott Walent"]["Company Blog","Events"]From June 30 to July 2, 2014 we held the <a href="http://spark-summit.org/2014">second Spark Summit</a>, a conference focused on promoting the adoption and growth of <a href="http://spark.apache.org">Apache Spark</a>. This was an exciting year for the Spark community and we are proud to share some highlights. <ul> <li>1,164 participants from over 453 companies attended</li> <li>Spark Training sold out at 300 participants</li> <li>31 organizations sponsored the event</li> <li>12 keynotes and 52 community presentations were given</li> </ul> &nbsp; Videos and slides from all presentations are now available on the <a href="http://spark-summit.org/2014/agenda">Summit 2014 agenda</a> page. Some highlights include: <ul> <li>Spark Summit <a href="https://www.youtube.com/watch?v=lO7LhVZrNwA&amp;index=2&amp;list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr">keynote from Databricks CEO Ion Stoica</a> introducing <a href="http://www.databricks.com/cloud">Databricks Cloud</a></li> <li>Open source comm...scott{"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"}null1081https://databricks.com/blog/2014/07/22/spark-summit-2014-highlights.htmlspark-summit-2014-highlightspublishSpark Summit 2014 Highlights2014-07-23T00:00:00.000+0000
["Oscar Mendez (CEO of Stratio)"]["Company Blog","Partners"]<div class="post-meta">This is a guest post from our friends at <a href="http://www.stratio.com" target="_blank">Stratio</a> announcing that their platform is now a "Certified Apache Spark Distribution".</div> <hr /> <h2>Certified distribution</h2> Stratio is delighted to announce that it is officially a Certified Apache Spark Distribution. The certification is very important for us because we deeply believe that the certification program provides many benefits to the Spark community: It facilitates collaboration and integration, offers broad evolution and support for the rich Spark ecosystem, simplifies adoption of critical security updates and allows development of applications valid for any certified distribution - a key ingredient for a successful ecosystem. <!--more--> This post is a brief history of how we started with big data technologies until we made the shift to Spark. <h2>When Stratio met Spark: A true love story</h2> We started using Big Data technologies more than 7 yea...arsalan{"createdOn":"2014-08-08","publishedOn":"2014-08-08","tz":"UTC"}null1144https://databricks.com/blog/2014/08/08/when-stratio-met-spark-a-true-love-story.htmlwhen-stratio-met-spark-a-true-love-storypublishWhen Stratio Met Apache Spark: A True Love Story2014-08-08T00:00:00.000+0000
["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"]["Apache Spark","Engineering Blog","Machine Learning"]<div class="post-meta">This is a guest blog post from our friends at Alibaba Taobao.</div> <hr /> Alibaba Taobao operates one of the world’s largest e-commerce platforms. We collect hundreds of petabytes of data on this platform and use Apache Spark to analyze these enormous amounts of data. Alibaba Taobao probably runs some of the largest Spark jobs in the world. For example, some Spark jobs run for weeks to perform feature extraction on petabytes of image data. In this blog post, we share our experience with Spark and GraphX from prototype to production at the Alibaba Taobao Data Mining Team. <!--more--> Every day, hundreds of millions of users and merchants interact on Alibaba Taobao’s marketplace. These interactions can be expressed as complicated, large scale graphs. Mining data requires a distributed data processing engine that can support fast interactive queries as well as sophisticated algorithms. Spark and GraphX embed a standard set of graph mining algorithms, including ...rxin{"createdOn":"2014-08-15","publishedOn":"2014-08-15","tz":"UTC"}null1170https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.htmlmining-graph-data-with-spark-at-alibaba-taobaopublishMining Ecommerce Graph Data with Apache Spark at Alibaba Taobao2014-08-15T00:00:00.000+0000
["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]["Apache Spark","Engineering Blog","Machine Learning"]One of our philosophies in Apache Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate various components of a data pipeline. <!--more-->We’re pleased to announce Apache Spark 1.1. ships with built-in support for several statistical algorithms common in exploratory data pipelines: <ul> <li><strong>correlations</strong>: data dependence analysis</li> <li><strong>hypothesis testing</strong>: goodness of fit; independence test</li> <li><strong>stratified sampling</strong>: scaling training set with controlled label distribution</li> <li><strong>random data generation</strong>: randomized algorithms; performance t...Xiangrui{"createdOn":"2014-08-27","publishedOn":"2014-08-27","tz":"UTC"}null1301https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.htmlstatistics-functionality-in-sparkpublishStatistics Functionality in Apache Spark 1.12014-08-27T00:00:00.000+0000
["Patrick Wendell"]["Apache Spark","Engineering Blog","Streaming"]Today we’re thrilled to announce the release of Apache Spark 1.1! Apache Spark 1.1 introduces many new features along with scale and stability improvements. This post will introduce some key features of Apache Spark 1.1 and provide context on the priorities of Spark for this and the next release.<!--more--> In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Apache Spark 1.1 is already available to Databricks customers and has also been posted today on the <a href="http://spark.apache.org/releases/spark-release-1-1-0.html">Apache Spark website</a>. <!--more--> <h2>Maturity of SparkSQL</h2> The 1.1 released upgrades Spark SQL significantly from the preview delivered in Apache Spark 1.0. At Databricks, we’ve migrated all of our customer workloads from Shark to Spark SQL, with between 2X and 5X <a href="https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html">perfo...patrick{"createdOn":"2014-09-12","publishedOn":"2014-09-12","tz":"UTC"}null1360https://databricks.com/blog/2014/09/11/announcing-spark-1-1.htmlannouncing-spark-1-1publishAnnouncing Apache Spark 1.12014-09-12T00:00:00.000+0000
["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]["Apache Spark","Engineering Blog","Streaming"]With Apache Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components - Spark Streaming - and highlight who is using Spark Streaming and why. Apache Spark 1.1. adds several new features to Spark Streaming.  In particular, Spark Streaming extends its library of ingestion sources to include Amazon Kinesis, a hosted stream processing engine, as well as to provide high availability for Apache Flume sources.  Moreover, Apache Spark 1.1 adds the first of a set of online machine learning algorithms with the introduction of a streaming linear regression. Many organizations have evolved from exploratory, discovery use cases of big data to use cases that require reasoning on data as it arrives in order to make decisions in real time.  Spark Streaming enables this category of high-value use cases, providing a system for processing fast and large streams of data in real time. <b>What is it?</b> Spark Streaming is an extension of the core S...arsalan{"createdOn":"2014-09-16","publishedOn":"2014-09-16","tz":"UTC"}null1386https://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.htmlspark-1-1-the-state-of-spark-streamingpublishApache Spark 1.1: The State of Spark Streaming2014-09-16T00:00:00.000+0000
["Burak Yavuz","Xiangrui Meng"]["Apache Spark","Engineering Blog","Machine Learning"]With an ever-growing community, Apache Spark has had it’s <a href="https://databricks.com/blog/2014/09/11/announcing-spark-1-1.html" target="_blank">1.1 release</a>. MLlib has had its fair share of contributions and now supports many new features. We are excited to share some of the performance improvements observed in MLlib since the 1.0 release, and discuss two key contributing factors: torrent broadcast and tree aggregation. <h2>Torrent broadcast</h2> The beauty of Spark as a unified framework is that any improvements made on the core engine come for free in its standard components like MLlib, Spark SQL, Streaming, and GraphX. In Apache Spark 1.1, we changed the default broadcast implementation of Spark from the traditional <code>HttpBroadcast</code> to <code>TorrentBroadcast</code>, a BitTorrent like protocol that evens out the load among the driver and the executors. When an object is broadcasted, the driver divides the serialized object into multiple chunks, and broadcasts the ch...Xiangrui{"createdOn":"2014-09-22","publishedOn":"2014-09-22","tz":"UTC"}null1393https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.htmlspark-1-1-mllib-performance-improvementspublishApache Spark 1.1: MLlib Performance Improvements2014-09-22T00:00:00.000+0000
["Gavin Targonski (Product Management at Talend)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.talend.com" target="_blank">Talend</a> after having Talend Studio <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> As the move to the next generation of integration platforms grows momentum, the need to implement a proven and scalable technology is critical. Databricks and Apache Spark, delivered on the major Hadoop distributions, is one such area where the delivery of massively scalable technology with low risk implementation is really key. At Talend we see a wide array of batch processes, moving to an operational and real time perspective, driven by the consumers of the data. In this vein, the uptake in adoption and the growing community of Apache Spark, the powerful open-source processing engine, has been hard to miss. In a relatively short time, it is now a part of every major Hadoop vendor’s offering, is the most active open sou...arsalan{"createdOn":"2014-09-15","publishedOn":"2014-09-15","tz":"UTC"}null1411https://databricks.com/blog/2014/09/15/application-spotlight-talend.htmlapplication-spotlight-talendpublishApplication Spotlight: Talend2014-09-15T00:00:00.000+0000
["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"]["Apache Spark","Engineering Blog"]<div class="post-meta">This is a guest post by Nick Pentreath of <a href="http://graphflow.com">Graphflow</a> and Kan Zhang of <a href="http://ibm.com">IBM</a>, who contributed Python input/output format support to Apache Spark 1.1.</div> <hr /> Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverage the extensive set of third-party libraries available (for example, the many data analysis libraries for Python). Built-in Hadoop support means that Spark can work "out of the box" with any data storage system or format that implements Hadoop's <code>InputFormat</code> and <code>OutputFormat</code> interfaces, including HDFS, HBase, Cassandra, Elasticsearch, DynamoDB and many others, as well as various data serialization formats s...matei{"createdOn":"2014-09-18","publishedOn":"2014-09-18","tz":"UTC"}null1431https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.htmlspark-1-1-bringing-hadoop-inputoutput-formats-to-pysparkpublishApache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark2014-09-18T00:00:00.000+0000
["Vida Ha"]["Company Blog","Product"]At Databricks, we are often asked how to go beyond the basic Apache Spark tutorials and start building real applications with Spark.  As a result, we are developing reference applications <a href="http://github.com/databricks/reference-apps" target="_blank">on github</a> to demonstrate that.  We believe this is a great way to learn Spark, and we plan on incorporating more features of Spark into the applications over time.  We also hope to highlight any technologies that are compatible with Spark and include best practices. <h3>Log Analyzer Application</h3> Our first reference application is log analysis with Spark.  Logs are a large and common data set that contain a rich set of information. Log data can be used for monitoring web servers, improving business and customer intelligence, building recommendation systems, preventing fraud, and much more.  Spark is a wonderful tool to use on logs - Spark can process logs faster than Hadoop MapReduce, it is easy to code so we can compute many...vida{"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"}null1460https://databricks.com/blog/2014/09/23/databricks-reference-applications.htmldatabricks-reference-applicationspublishDatabricks Reference Applications2014-09-24T00:00:00.000+0000
["John Tripier","Paco Nathan"]["Announcements","Company Blog"]When Databricks was initially founded a little more than a year ago, there was tremendous excitement around Apache Spark, but it was still early days. The project had ~60 contributors over the previous 12 months, and was not yet available commercially. One of our main focus areas since then has been continuing to grow Spark and the community and making it easily accessible for enterprises and users alike. Taking a step back, it’s terrific to see the progress that Spark has made since then. Spark is today the most active open source project in the Big Data ecosystem with over 300 contributors in the last 12 months alone, and is available through several platform vendors, including all of the major Hadoop distributors. The <a href="http://www.spark-summit.org" target="_blank">Spark Summit</a>, dedicated to bringing together the Spark community, more than doubled in size a short 6 months after the inaugural version, and Spark meetups continue to grow in size, frequency, and cities sp...john{"createdOn":"2014-09-19","publishedOn":"2014-09-19","tz":"UTC"}null1504https://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.htmldatabricks-and-oreilly-media-launch-certification-program-for-apache-spark-developerspublishDatabricks and O'Reilly Media launch Certification Program for Apache Spark Developers2014-09-19T00:00:00.000+0000
["Christopher Burdorf (Senior Software Engineer at NBC Universal)"]["Company Blog","Customers"]<div class="post-meta">This is a guest blog post from our friends at NBC Universal outlining their Apache Spark use case.</div> <hr /> <h2>Business Challenge</h2> NBC Universal is one of the world’s largest media and entertainment companies with revenues of US$ 26 billion. It operates television networks, cable channels, motion picture and television production companies as well as branded theme parks worldwide. Popular brands include NBC, Universal Pictures, Universal Parks &amp; Resorts, Telemundo, E!, Bravo and MSNBC. Digital video media clips for NBC Universal’s cable TV programs and commercials are produced and broadcast from its Los Angeles office to cable TV channels in Asia Pacific, Europe, Latin America and the United States. Moreover, viewers increasingly consume NBC Universal’s vast content library online and on-demand. Therefore, NBC Universal’s IT Infrastructure team needs to make decisions on how best to serve that content, which involves a trade-off between storage a...arsalan{"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"}null1619https://databricks.com/blog/2014/09/24/apache-spark-improves-the-economics-of-video-distribution-at-nbc-universal.htmlapache-spark-improves-the-economics-of-video-distribution-at-nbc-universalpublishApache Spark Improves the Economics of Video Distribution at NBC Universal2014-09-24T00:00:00.000+0000
["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"]["Engineering Blog","Machine Learning"]<div class="post-meta">This is a post written together with one of our friends at <a href="http://www.origamilogic.com/">Origami Logic</a>. Origami Logic provides a Marketing Intelligence Platform that uses Apache Spark for heavy lifting analytics work on the backend.</div> <hr /> Decision trees and their ensembles are industry workhorses for the machine learning tasks of classification and regression. Decision trees are easy to interpret, handle categorical and continuous features, extend to multi-class classification, do not require feature scaling and are able to capture non-linearities and feature interactions. Due to their popularity, almost every machine learning library provides an implementation of the decision tree algorithm. However, most are designed for single-machine computation and seldom scale elegantly to a distributed setting. Apache Spark is an ideal platform for a scalable distributed decision tree implementation since Spark's in-memory computing allows us to effi...joseph{"createdOn":"2014-09-30","publishedOn":"2014-09-30","tz":"UTC"}null1507https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.htmlscalable-decision-trees-in-mllibpublishScalable Decision Trees in MLlib2014-09-30T00:00:00.000+0000
["Eric Carr (VP Core Systems Group at Guavus)"]["Company Blog","Partners"]<div class="post-meta">This is a guest blog post from our friends at <a href="http://www.guavus.com" target="_blank">Guavus</a> - now a Certified Apache Spark Distribution - outlining how they leverage Spark to deliver value to telecom companies.</div> <hr /> <h2>Business Challenge</h2> Guavus is a leading provider of big data analytics solutions for the Communications Service Provider (CSP) industry. The company counts 4 of the top 5 mobile network operators, 3 of the top 5 Internet backbone providers, as well as 80% of cable MSOs in North America as customers. The Guavus Reflex platform provides operational intelligence to these service providers. Reflex currently analyzes more than 50% of all US mobile data traffic and processes more than 2.5 petabytes of data per day. Yet that data grows at an exponential rate. Ever increasing data volume and velocity makes it harder to generate timely insights. For instance, one operational issue can quickly cascade into multiple issues down-st...arsalan{"createdOn":"2014-09-25","publishedOn":"2014-09-25","tz":"UTC"}null1626https://databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcos.htmlguavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcospublishGuavus Embeds Apache Spark into its Operational Intelligence Platform Deployed at the World’s Largest Telcos2014-09-25T00:00:00.000+0000
["Jeremy Freeman (Freeman Lab)"]["Apache Spark","Engineering Blog","Streaming"]The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions of neurons somehow work together to endow organisms with the extraordinary ability to interact with the world around them. Things our brains control effortlessly -- kicking a ball, or reading and understanding this sentence -- have proven extremely hard to implement in a machine. For a long time, our efforts were limited by experimental technology. Despite the brain having many neurons, most technologies could only monitor the activity of one, or a handful, at once. That these approaches taught us so much -- for example, that there are neurons that respond only when you look at a particular object -- is a testament to experimental ingenuity. In the next era, however, we will be limited not by our recordings, but our ability to make sense of the data. New technologies make it possible to monitor the activity of many thousands of neurons at once -- fro...arsalan{"createdOn":"2014-10-01","publishedOn":"2014-10-01","tz":"UTC"}null1648https://databricks.com/blog/2014/10/01/spark-as-a-platform-for-large-scale-neuroscience.htmlspark-as-a-platform-for-large-scale-neurosciencepublishApache Spark as a platform for large-scale neuroscience2014-10-01T00:00:00.000+0000
["Russell Cardullo (Sharethrough)"]["Company Blog","Customers"]<div class="post-meta">This is a guest blog post from our friends at <a href="http://www.sharethrough.com" target="_blank">Sharethrough</a> providing an update on how their use of Apache Spark has continued to expand.</div> <hr /> <h2>Business Challenge</h2> Sharethrough is an advertising technology company that provides native, in-feed advertising software to publishers and advertisers. Native, in-feed ads are designed to match the form and function of the sites they live on, which is particularly important on mobile devices where interruptive advertising is less effective. For publishers, in-feed monetization has become a major revenue stream for their mobile sites and applications. For advertisers, in-feed ads have been proven to drive more brand lift than interruptive banner advertisements. Sharethrough’s publisher and advertiser technology suite is capable of optimizing the format of an advertisement for seamless placement on content publishers websites and apps. This involves ...arsalan{"createdOn":"2014-10-07","publishedOn":"2014-10-07","tz":"UTC"}null1668https://databricks.com/blog/2014/10/07/sharethrough-uses-spark-streaming-to-optimize-advertisers-return-on-marketing-investment.htmlsharethrough-uses-spark-streaming-to-optimize-advertisers-return-on-marketing-investmentpublishSharethrough Uses Apache Spark Streaming to Optimize Advertisers' Return on Marketing Investment2014-10-07T00:00:00.000+0000
["Sean Kandel (CTO at Trifacta)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.trifacta.com" target="_blank">Trifacta</a> after having their data transformation platform <a href="http://www.databricks.com/certification" target="_blank">“Certified on Spark.”</a></div> <hr> Today we announced v2 of the Trifacta Data Transformation Platform, a release that emphasizes the important role that Hadoop plays in the new big data enterprise architecture. With Trifacta v2 we now support transforming data of all shapes and sizes in Hadoop. This means supporting Hadoop-specific data formats as both inputs and outputs in Trifacta v2 - data formats such as Avro, ORC and Parquet. It also means intelligently executing data transformation scripts through not only MapReduce, which was available in Trifacta v1, but also Spark. Trifacta v2 has been officially Certified on Spark by Databricks. Our partnership with Databricks brings the performance and flexibility of the Spark data processing en...arsalan{"createdOn":"2014-10-09","publishedOn":"2014-10-09","tz":"UTC"}null1678https://databricks.com/blog/2014/10/09/application-spotlight-trifacta.htmlapplication-spotlight-trifactapublishApplication Spotlight: Trifacta2014-10-09T00:00:00.000+0000
["Reynold Xin"]["Apache Spark","Engineering Blog"]<strong>Update November 5, 2014</strong>: Our benchmark entry has been reviewed by the benchmark committee and Apache Spark has won the <a href="http://sortbenchmark.org/">Daytona GraySort contest</a> for 2014! Please see this <a href="https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html">new blog post for update</a>. Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it s...rxin{"createdOn":"2014-10-10","publishedOn":"2014-10-10","tz":"UTC"}null1685https://databricks.com/blog/2014/10/10/spark-petabyte-sort.htmlspark-petabyte-sortpublishApache Spark the fastest open source engine for sorting a petabyte2014-10-10T00:00:00.000+0000
["Reza Zadeh"]["Apache Spark","Engineering Blog","Machine Learning"]<div class="post-meta">Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its <a href="https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum" target="_blank">open-source contribution</a>, with permission. The associated <a href="https://github.com/apache/spark/pull/1778" target="_blank">pull request</a> is slated for release in Apache Spark 1.2.</div> <hr /> <h2>Introduction</h2> We are often interested in finding users, hashtags and ads that are very similar to one another, so they may be recommended and shown to users and advertisers. To do this, we must consider many pairs of items, and evaluate how “similar” they are to one another. We call this the “all-pairs similarity” problem, sometimes known as a “similarity join.” We have developed a new efficient algorithm to solve the similarity join called “Dimension Independent Matrix Square using MapReduce,” or <a href="http://arxiv.org/abs/1304.1467" target="_blank">DIM...arsalan{"createdOn":"2014-10-20","publishedOn":"2014-10-20","tz":"UTC"}null1743https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.htmlefficient-similarity-algorithm-now-in-spark-twitterpublishEfficient similarity algorithm now in Apache Spark, thanks to Twitter2014-10-20T00:00:00.000+0000
["Jeff Feng (Product Manager at Tableau Software)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.tableausoftware.com" target="_blank">Tableau Software</a>, whose visual analytics software is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <img class="aligncenter size-full wp-image-62" style="max-width: 100%; display: block; margin: 30px auto 5px auto;" src="https://databricks.com/wp-content/uploads/2014/10/Tableau-SparkSQL.png" alt="" align="middle" /> <h2>Apache Spark - The Next Big Innovation</h2> Once every few years or so, the big data open source community experiences a major innovation that advances the capabilities of data processing frameworks. For many years, MapReduce and the Hadoop open-source platform served as an effective foundation for the distributed processing of large data sets. Then last year, the introduction of YARN provided the resource manager needed to enable interactive workloads, bringing data proce...arsalan{"createdOn":"2014-10-15","publishedOn":"2014-10-15","tz":"UTC"}null1773https://databricks.com/blog/2014/10/15/application-spotlight-tableau-software.htmlapplication-spotlight-tableau-softwarepublishApplication Spotlight: Tableau Software2014-10-15T00:00:00.000+0000
["Scott Walent"]["Announcements","Company Blog","Events"]The call for presentations for the inaugural <a href="http://spark-summit.org/east">Spark Summit East</a> is now open. Please join us in New York City on March 18-19, 2015 to share your experience with Apache Spark and celebrate its growing community. Spark Summit East is looking for presenters who would like to showcase how Spark and its related technologies are used in applications, development, data science and research. Please visit our <a href="http://www.spark-summit.org/east/2015/CFP">submission page</a> for additional details. The Deadline for submissions is December 5, 2014 at 11:59pm PST. Spark Summit East is the leading event for <a href="http://spark.apache.org">Apache Spark </a>users, developers and vendors. It is an exciting opportunity to meet analysts, researchers, developers and executives interested in utilizing Spark technology to answer big data questions. If you missed <a href="http://spark-summit.org/2014">Spark Summit 2014</a>, all the content is available onl...scott{"createdOn":"2014-10-23","publishedOn":"2014-10-23","tz":"UTC"}null1809https://databricks.com/blog/2014/10/23/spark-summit-east-cfp-now-open.htmlspark-summit-east-cfp-now-openpublishSpark Summit East - CFP now open2014-10-23T00:00:00.000+0000
["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.faimdata.com" target="_blank">Faimdata</a>, whose Consumer Data Intelligence Service is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>Forecasting, Analytics, Intelligence, Machine Learning</h2> Faimdata’s Consumer Data Intelligence Service is a turnkey Big Data solution that provides comprehensive infrastructure and applications to retailers. We help our clients form close connections with their customers and make timely business decisions, using their existing data sources. The unified data processing pipeline deployed by Faimdata has three core focuses. They are (i) our Personalization Service that identifies the personal preferences and buying behaviors of each individual consumer using recommendation/machine learning algorithms; (ii) our Data Analytic Workbench where clients execute high performance multi-dimensional an...arsalan{"createdOn":"2014-10-27","publishedOn":"2014-10-27","tz":"UTC"}null1820https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.htmlapplication-spotlight-faimdatapublishApplication Spotlight: Faimdata2014-10-27T00:00:00.000+0000
["John Kreisa (VP of Strategic Marketing at Hortonworks)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.hortonworks.com" target="_blank">Hortonworks</a> announcing a broader partnership with Databricks around Apache Spark.</div> <hr> At Hortonworks we are very excited by the emerging use cases and potential of Apache Spark and Apache Hadoop. Spark is representative of just one of the shifts underway in the data landscape towards memory optimized processing, that when combined with Hadoop, can enable a new generation of applications. We are excited to announce that Hortonworks and Databricks have extended our partnership focus from providing a <a href="https://databricks.com/spark/certification/certified-spark-distribution" target="_blank">Certified Spark Distribution</a> to include a shared vision to further Apache Spark as an enterprise ready component of the Hortonworks Data Platform. We are closely aligned on a strategy and vision of bringing 100% open source software to market for the enterp...arsalan{"createdOn":"2014-10-31","publishedOn":"2014-10-31","tz":"UTC"}null1823https://databricks.com/blog/2014/10/31/hortonworks-a-shared-vision-for-apache-spark-on-hadoop.htmlhortonworks-a-shared-vision-for-apache-spark-on-hadooppublishHortonworks: A shared vision for Apache Spark on Hadoop2014-10-31T00:00:00.000+0000
["Sachin Chawla (VP of Engineering)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.skytree.net" target="_blank">Skytree</a>, whose Skytree Infinity platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>To Infinity and Beyond - Big Data at the speed of light</h2> Astronomers were into Big Data before it was big. In order to learn about the history of the universe, they needed to observe and record billions and billions of astronomical objects and perform heavy-duty analysis on the resulting massive datasets. Available predictive methods were not scalable to the size of data sets they were dealing with so they turned to Skytree to obtain unprecedented performance and accuracy on the largest datasets ever collected. Fast-forward a decade or so and the need to store, access, process and analyze datasets of astronomical sizes is now mainstream in the guise of Big Data analytics. <a href="http://www.skytre...john{"createdOn":"2014-11-25","publishedOn":"2014-11-25","tz":"UTC"}null1974https://databricks.com/blog/2014/11/24/application-spotlight-skytree-infinity.htmlapplication-spotlight-skytree-infinitypublishApplication Spotlight: Skytree Infinity2014-11-25T00:00:00.000+0000
["Sonal Goyal (CEO)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://nubetech.co/" target="_blank">Nube Technologies</a>, whose Reifier platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>About Nube Technologies</h2> Nube Technologies builds business applications to better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate. <h2>Why Apache Spark</h2> Data matching within a single source or across sources is a very core problem faced by almost every enterprise and we wanted to create a re...john{"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"}null2006https://databricks.com/blog/2014/12/02/application-spotlight-nube-reifier.htmlapplication-spotlight-nube-reifierpublishApplication Spotlight: Nube Reifier2014-12-02T00:00:00.000+0000
[" Dibyendu Bhattacharya (Big Data Architect)"]["Company Blog","Partners"]<div class="post-meta">This is a guest blog post from our friends at Pearson outlining their Apache Spark use case.</div> <hr /> <h2>Introduction of Pearson</h2> Pearson is a British multinational publishing and education company headquartered in London. It is the largest education company and the largest book publisher in the world. Recently, Pearson announced a new organization structure in order to accelerate their push into digital learning, education services and emerging markets. I am part of Pearson Higher Education group, which provides textbooks and digital technologies to teachers and students across Higher Education. Pearson's higher education brands include eCollege, Mastering/MyLabs and Financial Times Publishing. <h2>What we wanted to do</h2> We are building a next generation adaptive learning platform which delivers immersive learning experiences designed for the way today’s students read, think, and learn. This learning platform is a scalable, reliable, cloud-based pl...john{"createdOn":"2014-12-09","publishedOn":"2014-12-09","tz":"UTC"}null2027https://databricks.com/blog/2014/12/08/pearson-uses-spark-streaming-for-next-generation-adaptive-learning-platform.htmlpearson-uses-spark-streaming-for-next-generation-adaptive-learning-platformpublishPearson uses Apache Spark Streaming for next generation adaptive learning platform2014-12-09T00:00:00.000+0000
["Reynold Xin"]["Apache Spark","Engineering Blog"]A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the <a href="http://sortbenchmark.org/">Daytona GraySort contest</a>! In case you missed our <a href="https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html">earlier blog post</a>, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Apache Spark sorted the same data <strong>3X faster</strong> using <strong>10X fewer machines</strong>. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record. <table class="...rxin{"createdOn":"2014-11-05","publishedOn":"2014-11-05","tz":"UTC"}null2465https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.htmlspark-officially-sets-a-new-record-in-large-scale-sortingpublishApache Spark officially sets a new record in large-scale sorting2014-11-05T00:00:00.000+0000
["Matt MacKinnon (Director of Product Management at Zaloni)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.zaloni.com" target="_blank">Zaloni</a>, whose Bedrock platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>Bedrock’s Managed Data Pipeline now includes Apache Spark</h2> It was evident from the all the buzz at the Strata + Hadoop World conference that Apache Spark has now shifted from the early adopter phase to establishing itself as an integral and permanent part of the Hadoop ecosystem. The rapid pace of adoption is impressive! Given the entrance of Spark into the mainstream Hadoop world, we are glad to announce that Bedrock is now officially Certified on Spark. <h2>How does Spark enhance Bedrock?</h2> Bedrock™ defines a Managed Data Pipeline as consisting of Ingest, Organize, and Prepare stages. Bedrock’s strength lies in the integrated nature of the way data is handled through these stages. ● Ingest: Bring data fr...john{"createdOn":"2014-11-14","publishedOn":"2014-11-14","tz":"UTC"}null2466https://databricks.com/blog/2014/11/14/application-spotlight-bedrock.htmlapplication-spotlight-bedrockpublishApplication Spotlight: Bedrock2014-11-14T00:00:00.000+0000
["John Tripier","Paco Nathan"]["Announcements","Company Blog"]More and more companies are using Apache Spark, and many Spark based pilots are currently deploying in production. In social media, at every big data conference or meetup, people describe new POC, prototypes, and production deployments using Spark. Behind this momentum, a growing need for Spark developers is developing; people who have demonstrated expertise in how to implement best practices for Spark. People who can help the enterprise building increasingly complex and sophisticated solutions on top of their Spark deployments. At Databricks, we get contacted by many enterprises looking for Spark resources to help with their next data-driven initiative. And so beyond our effort to train people on Spark directly or through partners all around the world, we have teamed up with O’Reilly for offering the first industry standard for measuring and validating a developer’s expertise on Spark. <h2>Benefits of being a Spark Certified Developer</h2> The Spark Developer Certification is the wa...john{"createdOn":"2014-11-15","publishedOn":"2014-11-15","tz":"UTC"}null2467https://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.htmlthe-spark-certified-developer-programpublishThe Apache Spark Certified Developer Program2014-11-15T00:00:00.000+0000
["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"]["Company Blog","Partners"]<div class="post-meta">This is a guest blog post from our friends at Samsung SDS outlining their Apache Spark use case.</div> <hr /> <h2>Business Challenge</h2> Samsung SDS is the business and IT solutions arm of Samsung Group. A global ICT service provider with over 17,000 employees worldwide and 6.7 billion USD in revenues, Samsung SDS tackles the challenges of some of the largest global enterprises in such industries as manufacturing, financial services, health care and retail. In the different areas Samsung is focused on, the ability to make timely decisions that maximize the value to a business becomes critical. Prescriptive analytics methods have been used effectively to support decision making by leveraging probable future outcomes determined by predictive models and suggesting actions that provide maximal business value. One of the main challenges in applying prescriptive analytics in these areas is the need to analyze a combination of structured and unstructured data at la...john{"createdOn":"2014-11-22","publishedOn":"2014-11-22","tz":"UTC"}null2468https://databricks.com/blog/2014/11/21/samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale.htmlsamsung-sds-uses-spark-for-prescriptive-analytics-at-large-scalepublishSamsung SDS uses Apache Spark for prescriptive analytics at large scale2014-11-22T00:00:00.000+0000
["Ameet Talwalkar","Anthony Joseph"]["Announcements","Company Blog"]In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, data science techniques are fast becoming core components of large-scale data processing pipelines. Apache Spark offers analysts and engineers a powerful tool for building these pipelines, and learning to build such pipelines will soon be a lot easier. Databricks is excited to be working with professors from University of California Berkeley and University of California Los Angeles to produce two new upcoming Massive Open Online Courses (MOOCs). Both courses will be freely available on the edX MOOC platform in <del>spring</del> summer 2015. edX Verified Certificates are also available for a fee. <img class="aligncenter size-full wp-image-62" style="max-width: 100%; display: block; margin: 30px auto 5px auto;" src="https://databricks.com/wp-content/uploads/2014/12/MOOC1.png" alt="" align="middle" /> The first course, called <a href="https://www.edx.org/course/uc...arsalan{"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"}null2469https://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.htmlannouncing-two-spark-based-moocspublishDatabricks to run two massive online courses on Apache Spark2014-12-02T00:00:00.000+0000
["Lieven Gesquiere (Virdata Lead Core R&D)"]["Company Blog","Partners"]<div class="post-meta">This post is guest authored by our friends at <a href="http://www.technicolor.com/" target="_blank">Technicolor</a>, whose Virdata platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>About Virdata</h2> Virdata is Technicolor’s cloud-native Internet of Things platform offering real-time monitoring, configuration and management of the unprecedented number of connected devices and applications. Combining its highly-scalable data ingestion and messaging capabilities with real-time and historical analytics, Virdata brings value across multiple data-driven markets. The Virdata platform was launched at CES Las Vegas in January, 2014. The Virdata cloud-based platform architecture integrates state-of-the-art open source software components into a homogeneous, high-availability data-processing environment. <h2>Virdata and Apache Spark</h2> The Virdata solution architecture comprises 3 areas:...john{"createdOn":"2014-12-04","publishedOn":"2014-12-04","tz":"UTC"}null2470https://databricks.com/blog/2014/12/03/application-spotlight-technicolor-virdata-internet-of-things-platform.htmlapplication-spotlight-technicolor-virdata-internet-of-things-platformpublishApplication Spotlight: Technicolor Virdata Internet of Things platform2014-12-04T00:00:00.000+0000
["by Databricks Press Office"]["Announcements","Company Blog"]<strong>Highlights:</strong> <ul> <li>Databricks Expands Bay Area Presence, Moves HQ to San Francisco</li> <li>Company Names Kavitha Mariappan as Marketing Vice President</li> </ul> Press Release: <a title="http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html" href="http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html">http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html</a> <strong>San Francisco, Calif. – January 13, 2015 – </strong><a href="http://www.databricks.com">Databricks</a>, the company founded by the creators of the popular open-source Big Data processing engine Apache Spark with its flagship product, Databricks Cloud, today announced the relocation of their headquarters to San Francisco from Berkeley, California. The expansion is a reflection of Databricks’ growth heading into 2015. The company grew more than 200 percent in headcount over the last year and adds talent to its executive ...kavitha{"createdOn":"2015-01-13","publishedOn":"2015-01-13","tz":"UTC"}null2294https://databricks.com/blog/2015/01/13/databricks-expands-bay-area-presence-moves-hq-to-san-francisco.htmldatabricks-expands-bay-area-presence-moves-hq-to-san-franciscopublishDatabricks Expands Bay Area Presence, Moves HQ to San Francisco2015-01-13T00:00:00.000+0000
["Kavitha Mariappan"]["Announcements","Company Blog"]Complementing our on-going direct and partner-led Apache Spark training efforts, Databricks has teamed up with O’Reilly to offer the industry’s first standard for measuring and validating a developer’s expertise with Spark. Databricks and O’Reilly are proud to announce the online availability of the Spark Certified Developer exams. You can now sign up and take the exam online<a href=" http://go.databricks.com/spark-certified-developer"> here</a>. <b>What is the Spark Certified Developer program?</b> Apache Spark is the most active project in the Big Data ecosystem and is fast becoming the open source alternative of choice for many enterprises. Spark provides enterprises with the scale and sophistication they require to gain insights from their Big Data by providing a unified framework for building data pipelines. Databricks was founded by the team that created and continues to lead both development and training around Spark, and<a href="https://databricks.com/product"> Databricks Cl...kavitha{"createdOn":"2015-01-16","publishedOn":"2015-01-16","tz":"UTC"}null2345https://databricks.com/blog/2015/01/16/spark-certified-developer-exams-available-online.htmlspark-certified-developer-exams-available-onlinepublishApache Spark Certified Developer exams available online!2015-01-16T00:00:00.000+0000
["Kavitha Mariappan"]["Company Blog","Events"]We are thrilled to announce the availability of the <a href="http://go.spark-summit.org/e1t/c/*W6stDzJ6_3DYhW6Y-qp35L8r5j0/*W4PZ7v36VwsQzW58WPXZ57MJJH0/5/f18dQhb0Sq5z8YHrDTW8HLj0x5VQHw7W6bFhBV6P7FhxW4R4BZM57mvC2W1BQYgg4P0TLvW85Q81T83G7d1W9dtj1h7NQNCqW4zWTRG33K-8nW7NMj-x9bTNXYW954KlM4P0Yt6W2d4hSK3bWrh8W2YH1kR47xfHKW2HRyfR6trFPNW47YlYy4bfcHbW47Xx4z3C811XW4-SZvb2KQ2YYW3_VZwP5ThdHgW3s1XjF51G0BJW4Zh8Y-57-WqMW3H_Pty2DzCtRW1zBkSq1sQ3b4W8V-D1g5rcXhJW7JS0c27BQjYmVJB4Mm896Q7XW94B_1g7v78c8W8NqNPC5qWyC0W7JTtyJ2Xm03sW3FBZ5D9lNHw9W6_b40v3vyNkPW6J4Ypk8lBfs0W3bnqM_1C-9rFVL--5_1Pct9JW2mPjk95hqX5PW9lKhck4H6s3gN4m21WR6Q977Vb98_P6s16_2W8Ph58-59BvQ0W7y34GD1FmQY-W7r71Hq2PhWHMW7tprCG95RqNQW2j-Sgt2L5GhqW3G6xft6TMH99W6-cC_w3wXTtZW6Sytzy9fTwQmN3FYx-Q_HpmRf6dY7D511" target="_blank">agenda</a> for Spark Summit East 2015! This inaugural New York City event on <span class="aBn" tabindex="0" data-term="goog_929332804"><span class="aQJ">March 18-19, 2015</span></span> has over thirty jam-packed sessions – offering a ...kavitha{"createdOn":"2015-01-20","publishedOn":"2015-01-20","tz":"UTC"}null2359https://databricks.com/blog/2015/01/20/spark-summit-east-2015-agenda-is-now-available.htmlspark-summit-east-2015-agenda-is-now-availablepublishSpark Summit East 2015 Agenda is Now Available2015-01-20T00:00:00.000+0000
["Yin Huai (Databricks)"]["Apache Spark","Engineering Blog"][sidenote]Note: Starting Spark 1.3, SchemaRDD will be renamed to DataFrame.[/sidenote] <hr /> In this blog post, we introduce Spark SQL’s JSON support, a feature we have been working on at Databricks to make it dramatically easier to query and create JSON data in Spark. With the prevalence of web and mobile applications, JSON has become the de-facto interchange format for web service API’s as well as long-term storage. With existing tools, users often engineer complex pipelines to read and write JSON data sets within analytical systems. Spark SQL’s JSON support, released in Apache Spark 1.1 and enhanced in Apache Spark 1.2, vastly simplifies the end-to-end-experience of working with JSON data.<!--more--> <h2>Existing practices</h2> In practice, users often face difficulty in manipulating JSON data with modern analytical systems. To write a dataset to JSON format, users first need to write logic to convert their data to JSON. To read and query JSON datasets, a common practice is to us...michael{"createdOn":"2015-02-02","publishedOn":"2015-02-02","tz":"UTC"}null2376https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.htmlan-introduction-to-json-support-in-spark-sqlpublishAn introduction to JSON support in Spark SQL2015-02-02T00:00:00.000+0000
["Jeremy Freeman (Howard Hughes Medical Institute)"]["Apache Spark","Engineering Blog","Streaming"]Many real world data are acquired sequentially over time, whether messages from social media users, time series from wearable sensors, or — in a case we are particularly excited about — the firing of large populations of neurons. In these settings, rather than wait for all the data to be acquired before performing our analyses, we can use streaming algorithms to identify patterns over time, and make more targeted predictions and decisions. One simple strategy is to build machine learning models on static data, and then use the learned model to make predictions on an incoming data stream. But what if the patterns in the data are themselves dynamic? That's where streaming algorithms come in. A key advantage of Apache Spark is that its machine learning library (MLlib) and its library for stream processing (Spark Streaming) are built on the same core architecture for distributed analytics. This facilitates adding extensions that leverage and combine components in novel ways without reinv...Xiangrui{"createdOn":"2015-01-28","publishedOn":"2015-01-28","tz":"UTC"}null2382https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.htmlintroducing-streaming-k-means-in-spark-1-2publishIntroducing streaming k-means in Apache Spark 1.22015-01-28T00:00:00.000+0000
["Dave Wang (Databricks)"]["Announcements","Company Blog"]Recently <a href="http://www.infoworld.com/article/2871935/application-development/infoworlds-2015-technology-of-the-year-award-winners.html" target="_blank">Infoworld unveiled the 2015 Technology of the Year Award winners</a>, which range from open source software to stellar consumer technologies like the iPhone.  Being the <a title="Announcing Spark 1.2" href="https://databricks.com/blog/2014/12/19/announcing-spark-1-2.html" target="_blank">creators behind Apache Spark</a>, Databricks is thrilled to see Spark in their ranks.  In fact, we built our flagship product, <a title="Databricks Cloud Overview" href="https://databricks.com/product">Databricks</a>, on top of Spark with the ambition to revolutionize big data processing in ways similar to how iPhone revolutionized the mobile experience. The iPhone was revolutionary in a number of ways: first, it integrated a disparate set of consumer electronic capabilities such as mobile phone, camera, GPS, and even laptop; second, it created a...dave_wang{"createdOn":"2015-02-05","publishedOn":"2015-02-05","tz":"UTC"}null2454https://databricks.com/blog/2015/02/05/apache-spark-selected-for-infoworld-2015-technology-of-the-year-award.htmlapache-spark-selected-for-infoworld-2015-technology-of-the-year-awardpublishApache Spark selected for Infoworld 2015 Technology of the Year Award2015-02-05T00:00:00.000+0000
["Patrick Wendell"]["Apache Spark","Engineering Blog"]We at Databricks are thrilled to announce the release of Apache Spark 1.2! Apache Spark 1.2 introduces many new features along with scalability, usability and performance improvements. This post will introduce some key features of Apache Spark 1.2 and provide context on the priorities of Spark for this and the next release. In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Apache Spark 1.2 has been posted today on the <a href="http://spark.apache.org/releases/spark-release-1-2-0.html">Apache Spark website</a>. Learn more about specific new features in related in-depth posts: <ul> <li><a title="Spark SQL Data Sources API: Unified Data Access for the Spark Platform" href="https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html" target="_blank">Spark SQL data sources API</a></li> <li><a title="An introduction to JSON support in Spark SQL" href="https:/...patrick{"createdOn":"2014-12-19","publishedOn":"2014-12-19","tz":"UTC"}null2471https://databricks.com/blog/2014/12/19/announcing-spark-1-2.htmlannouncing-spark-1-2publishAnnouncing Apache Spark 1.22014-12-19T00:00:00.000+0000
["Xiangrui Meng","Patrick Wendell"]["Apache Spark","Ecosystem","Engineering Blog"]Today, we are happy to announce <em>Apache Spark Packages</em> (<a title="http://spark-packages.org" href="http://spark-packages.org">http://spark-packages.org</a>), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. <em>Spark Packages</em> makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. <!--more--> <em>Spark Packages</em> will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes <a href="http://spark-packages.org/package/6">scientific computing libraries</a>, a <a href="http://spark-packages.org/package/10">job execution server</a>, a connector for <a href="http://spark-packages.org/package/3">importing Avro data</a>, tool...Xiangrui{"createdOn":"2014-12-22","publishedOn":"2014-12-22","tz":"UTC"}null2472https://databricks.com/blog/2014/12/22/announcing-spark-packages.htmlannouncing-spark-packagespublishAnnouncing Apache Spark Packages2014-12-22T00:00:00.000+0000
["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"]["Engineering Blog","Machine Learning"]MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in each release, a great deal of time and effort has been spent on making MLlib <i>easy</i>. Similar to Spark Core, MLlib provides APIs in three languages: Python, Java, and Scala, along with user guide and example code, to ease the learning curve for users coming from different backgrounds. In Apache Spark 1.2, Databricks, jointly with AMPLab, UC Berkeley, continues this effort by introducing a pipeline API to MLlib for easy creation and tuning of practical ML pipelines. A practical ML pipeline often involves a sequence of data pre-processing, feature extraction, model fitting, and validation stages. For example, classifying text documents might involve text segmentation and cleaning, extracting features, and training a classification model with cross-validation. Though there are many libraries we can use for each stage, connecting the dots ...Xiangrui{"createdOn":"2015-01-07","publishedOn":"2015-01-07","tz":"UTC"}null2473https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.htmlml-pipelines-a-new-high-level-api-for-mllibpublishML Pipelines: A New High-Level API for MLlib2015-01-07T00:00:00.000+0000
["Michael Armbrust"]["Apache Spark","Engineering Blog"]Since the inception of Spark SQL in Apache Spark 1.0, one of its most popular uses has been as a conduit for pulling data into the Spark platform.  Early users loved Spark SQL’s support for reading data from existing Apache Hive tables as well as from the popular Parquet columnar format. We’ve since added support for other formats, such as <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets">JSON</a>.  In Apache Spark 1.2, we've taken the next step to allow Spark to integrate natively with a far larger number of input sources.  These new integrations are made possible through the inclusion of the new Spark SQL Data Sources API. <a href="https://databricks.com/wp-content/uploads/2015/01/DataSourcesApiDiagram.png"><img class="wp-image-2372 aligncenter" src="https://databricks.com/wp-content/uploads/2015/01/DataSourcesApiDiagram-1024x526.png" alt="DataSourcesApiDiagram" width="516" height="265" /></a> The Data Sources API provides a pluggable mechanism...michael{"createdOn":"2015-01-09","publishedOn":"2015-01-09","tz":"UTC"}null2474https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.htmlspark-sql-data-sources-api-unified-data-access-for-the-spark-platformpublishSpark SQL Data Sources API: Unified Data Access for the Apache Spark Platform2015-01-09T00:00:00.000+0000
["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"]["Apache Spark","Engineering Blog","Machine Learning"]<div class="post-meta">This is a post written together with Manish Amde from <a href="http://www.origamilogic.com/">Origami Logic</a>.</div> <hr /> Apache Spark 1.2 introduces <a href="http://en.wikipedia.org/wiki/Random_forest">Random Forests</a> and <a href="http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting">Gradient-Boosted Trees (GBTs)</a> into MLlib. Suitable for both classification and regression, they are among the most successful and widely deployed machine learning methods. Random Forests and GBTs are <i>ensemble learning algorithms</i>, which combine multiple decision trees to produce even more powerful models. In this post, we describe these models and the distributed implementation in MLlib. We also present simple examples and provide pointers on how to get started. <h2>Ensemble Methods</h2> Simply put, <a href="http://en.wikipedia.org/wiki/Ensemble_learning">ensemble learning algorithms</a> build upon other machine learning methods by combining models...joseph{"createdOn":"2015-01-21","publishedOn":"2015-01-21","tz":"UTC"}null2475https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.htmlrandom-forests-and-boosting-in-mllibpublishRandom Forests and Boosting in MLlib2015-01-21T00:00:00.000+0000
["Tathagata Das"]["Apache Spark","Engineering Blog","Streaming"]Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Apache Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Apache Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications. <h2>Background</h2> Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Since Spark Streaming is built on Spark, it enjoys the same fault-tolerance for worker nodes. However, the demand of high uptimes ...tdas{"createdOn":"2015-01-15","publishedOn":"2015-01-15","tz":"UTC"}null2476https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.htmlimproved-driver-fault-tolerance-and-zero-data-loss-in-spark-streamingpublishImproved Fault-tolerance and Zero Data Loss in Apache Spark Streaming2015-01-15T00:00:00.000+0000
["Kavitha Mariappan"]["Announcements","Company Blog"]In partnership with <a href="https://typesafe.com/">Typesafe</a>, we are excited to see the publication of the <a href="http://info.typesafe.com/COLL-20XX-Spark-Survey-Report_LP.html?lst=PR&amp;lsd=COLL-20XX-Spark-Survey-Trends-Adoption-Report">survey report</a> representing the largest poll of Apache Spark developers to date. Spark is currently the most active open source project in big data and has been rapidly gaining traction over the past few years. This survey of over 2100 respondents further validates the wide variety of use cases and environments where it is being deployed. The survey results indicate that 13% are already using Spark in production environments with 20% of the respondents with plans to deploy Spark in production environments in 2015, and 31% are currently in the process of evaluating it. In total, the survey covers over 500 enterprises that are using or planning to use Spark in production environments ranging from on-premise Hadoop clusters to public clouds, wi...kavitha{"createdOn":"2015-01-27","publishedOn":"2015-01-27","tz":"UTC"}null2477https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.htmlbig-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-tractionpublishBig data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction!2015-01-27T00:00:00.000+0000
["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"]["Announcements","Company Blog"]<a href="https://databricks.com/wp-content/uploads/2015/02/large-oreilly-book-cover.jpg"><img class="size-medium wp-image-2486 aligncenter" src="https://databricks.com/wp-content/uploads/2015/02/large-oreilly-book-cover-228x300.jpg" alt="large oreilly book cover" width="228" height="300" /></a> Today we are happy to announce that the complete <a href="http://shop.oreilly.com/product/0636920028512.do" target="_blank"><i>Learning Spark</i></a> book is available from O’Reilly in e-book form with the print copy expected to be available February 16th. At Databricks, as the creators behind Apache Spark, we have witnessed <a title="Big data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction!" href="https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html" target="_blank">explosive growth in the interest and adoption ...patrick{"createdOn":"2015-02-09","publishedOn":"2015-02-09","tz":"UTC"}null2479https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.htmllearning-spark-book-available-from-oreillypublish"Learning Spark" book available from O'Reilly2015-02-09T00:00:00.000+0000
null["Announcements","Company Blog","Customers"]We're really excited to share that <a href="http://www.automatic.com">Automatic Labs </a>has selected Databricks as its preferred big data processing platform. Press release: <a href="http://www.marketwired.com/press-release/automatic-labs-turns-databricks-cloud-faster-innovation-dramatic-cost-savings-1991316.htm" target="_blank">http://www.marketwired.com/press-release/automatic-labs-turns-databricks-cloud-faster-innovation-dramatic-cost-savings-1991316.htm</a> Automatic Labs needed to run large and complex queries against their entire data set to explore and come up with new product ideas. Their prior solution using Postgres impeded the ability of Automatic’s team to efficiently explore data because queries took days to run and data could not be easily visualized, preventing Automatic Labs from bringing critical new products to market. They then deployed Databricks, our simple yet powerful unified big data processing platform on Amazon Web Services (AWS) and realized these key bene...kavitha{"createdOn":"2015-02-13","publishedOn":"2015-02-13","tz":"UTC"}null2566https://databricks.com/blog/2015/02/12/automatic-labs-selects-databricks-cloud-for-primary-real-time-data-processing.htmlautomatic-labs-selects-databricks-cloud-for-primary-real-time-data-processingpublishAutomatic Labs Selects Databricks for Primary Real-Time Data Processing2015-02-13T00:00:00.000+0000
null["Apache Spark","Engineering Blog"]2014 has been a year of <a href="https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html">tremendous growth</a> for Apache Spark.  It became the most active open source project in the Big Data ecosystem with over 400 contributors, and was adopted by many platform vendors - including all of the major Hadoop distributors.  Through our ecosystem of products, partners, and training at Databricks, we also saw over 200 enterprises deploying Spark in production. To help Spark achieve this growth, Databricks has worked broadly throughout the project to improve functionality and ease of use. Indeed, while the community has grown a lot, about 75% of the code added to Spark last year came from Databricks. In this post, we would like to highlight some of the additions we made to Spark in 2014, and provide a preview of our priorities in 2015. In general, our approach to developing Spar...patrick{"createdOn":"2015-02-14","publishedOn":"2015-02-14","tz":"UTC"}Spark: A review of 2014 and looking ahead to 2015 priorities2576https://databricks.com/blog/2015/02/13/spark-a-review-of-2014-and-looking-ahead-to-2015-priorities.htmlspark-a-review-of-2014-and-looking-ahead-to-2015-prioritiespublishApache Spark: A review of 2014 and looking ahead to 2015 priorities2015-02-14T00:00:00.000+0000
null["Company Blog","Partners"]This is a guest blog from our one of our partners: <a href="http://www.memsql.com/" target="_blank">MemSQL</a> <hr /> &nbsp; <h2>Summary</h2> Coupling operational data with the most advanced analytics puts data-driven business ahead. The MemSQL Apache Spark Connector enables such configurations. <h2>Meeting Transactional and Analytical Needs</h2> Transactional databases form the core of modern business operations. Whether that transaction is financial, physical in terms of inventory changes, or experiential in terms of a customer engagement, the transaction itself moves our business forward. But while transactions represent the state of our business, analytics tell us patterns of the past, and help us predict patterns of the future. Analytics can tell us what levers influence profitability and put us ahead of the pack. Success in digital business requires both transactional and analytical prowess, including the foremost means to analyze data. <h2>Speed and Agility with MemSQL and A...dave_wang{"createdOn":"2015-02-19","publishedOn":"2015-02-19","tz":"UTC"}null2749https://databricks.com/blog/2015/02/19/extending-memsql-analytics-with-spark.htmlextending-memsql-analytics-with-sparkpublishExtending MemSQL Analytics with Apache Spark2015-02-19T00:00:00.000+0000
null["Apache Spark","Engineering Blog"]Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When we first open sourced Apache Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens. As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind.  This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature: <ul> <li>Ability to scale from kilobytes o...rxin{"createdOn":"2015-02-17","publishedOn":"2015-02-17","tz":"UTC"}null2757https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.htmlintroducing-dataframes-in-spark-for-large-scale-data-sciencepublishIntroducing DataFrames in Apache Spark for Large Scale Data Science2015-02-17T00:00:00.000+0000
null["Company Blog","Events"]The Strata + Hadoop World Conference in San Jose last week was abuzz with "putting data to work" in keeping with this year's conference theme. This was a significant shift from last year's event where organizations were highly focused on getting their arms around their big data projects and being steeped in evaluating the multitude of tools of new technologies available. Last week's event highlighted what is top of mind for enterprises and developers alike - how to turn their big data initiatives and projects into real business results? One theme was loud and clear - Apache Spark's flame shone bright!  Derrick Harris from GigaOM summed this up aptly in his article "<a href="https://gigaom.com/2015/02/20/for-now-spark-looks-like-the-future-of-big-data/" target="_blank">For now, Spark looks like the future of big data</a>". To quote Derrick, <em>"Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop ...dave_wang{"createdOn":"2015-02-24","publishedOn":"2015-02-24","tz":"UTC"}null2830https://databricks.com/blog/2015/02/24/databricks-at-strata-san-jose.htmldatabricks-at-strata-san-josepublishDatabricks at Strata San Jose2015-02-24T00:00:00.000+0000
null["Company Blog","Product"]<div class="article-body"> Enterprises have been collecting ever-larger amounts of data with the goal of extracting insights and creating value. Yet despite a few innovative companies who are able to successfully exploit big data, the promised returns of big data remain elusive beyond the grasp of many enterprises. One notable and rapidly growing open source technology that has emerged in the big data space is Apache Spark. Spark is an open source data processing framework that was built for speed, ease of use, and scale. Much of its benefits are due to how it unifies critical data analytics capabilities such as SQL, machine learning and streaming in a single framework. This enables enterprises to simultaneously achieve high performance computing at scale while simplifying their data processing infrastructure by avoiding the difficult integration of many disparate and difficult tools with a single powerful yet simple alternative. While Spark appears to have the potential to solve m...kavitha{"createdOn":"2015-03-04","publishedOn":"2015-03-04","tz":"UTC"}null2871https://databricks.com/blog/2015/03/04/databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instant.htmldatabricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instantpublishDatabricks: From raw data, to insights and data products in an instant!2015-03-04T00:00:00.000+0000

Showing the first 152 rows.

With this temporary view, apply the printSchema method to check its schema and confirm the timestamp conversion.

databricksBlog2DF.printSchema()
root |-- authors: array (nullable = true) | |-- element: string (containsNull = true) |-- categories: array (nullable = true) | |-- element: string (containsNull = true) |-- content: string (nullable = true) |-- creator: string (nullable = true) |-- dates: struct (nullable = true) | |-- createdOn: string (nullable = true) | |-- publishedOn: string (nullable = true) | |-- tz: string (nullable = true) |-- description: string (nullable = true) |-- id: long (nullable = true) |-- link: string (nullable = true) |-- slug: string (nullable = true) |-- status: string (nullable = true) |-- title: string (nullable = true) |-- publishedOn: timestamp (nullable = true)
from pyspark.sql.functions import to_timestamp, year, col
          
resultDF = (databricksBlog2DF.select("title", to_timestamp(col("publishedOn"),"MMM dd, yyyy").alias('date'),"link") 
  .filter(year(col("publishedOn")) == '2013') 
  .orderBy(col("publishedOn"))
)

display(resultDF)
Databricks and the Apache Spark Platform2013-10-27T00:00:00.000+0000https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html
The Growing Apache Spark Community2013-10-28T00:00:00.000+0000https://databricks.com/blog/2013/10/27/the-growing-spark-community.html
Databricks and Cloudera Partner to Support Apache Spark2013-10-29T00:00:00.000+0000https://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html
Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications2013-11-22T00:00:00.000+0000https://databricks.com/blog/2013/11/21/putting-spark-to-use.html
Highlights From Spark Summit 20132013-12-19T00:00:00.000+0000https://databricks.com/blog/2013/12/18/spark-summit-2013-follow-up.html
Apache Spark 0.8.1 Released2013-12-20T00:00:00.000+0000https://databricks.com/blog/2013/12/19/release-0_8_1.html

Array Data

The DataFrame also contains array columns.

Easily determine the size of each array using the built-in size(..) function with array columns.

from pyspark.sql.functions import size
display(databricksBlogDF.select(size("authors"),"authors"))
1["Tomer Shiran (VP of Product Management at MapR)"]
1["Tathagata Das"]
1["Steven Hillion"]
2["Michael Armbrust","Reynold Xin"]
1["Patrick Wendell"]
2["Ali Ghodsi","Ahir Reddy"]
2["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"]
2["Jai Ranganathan","Matei Zaharia"]
1["Databricks Press Office"]
1["Ion Stoica"]
2["Ahir Reddy","Reynold Xin"]
1["Pat McDonough"]
1["Ion Stoica"]
1["Patrick Wendell"]
1["Andy Konwinski"]
1["Pat McDonough"]
1["Ion Stoica"]
1["Matei Zaharia"]
2["Ion Stoica","Matei Zaharia"]
1["Arsalan Tavakoli-Shiraji"]
2["Prashant Sharma","Matei Zaharia"]
1["Databricks Training Team"]
1["Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)"]
1["Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)"]
1["Patrick Wendell"]
2["Michael Armbrust","Zongheng Yang"]
1["Michael Hiskey (VP at MicroStrategy Inc.)"]
1["Christopher Nguyen (CEO &amp; Co-Founder of Adatao)"]
1["Databricks Press Office"]
1["Dean Wampler (Typesafe)"]
1["Hari Kodakalla (EVP at Apervi Inc.)"]
1["Bill Kehoe (Big Data Architect at Qlik)"]
1["Databricks Press Office"]
1["Costin Leau (Engineer at Elasticsearch)"]
1["Jake Cornelius (SVP of Product Management at Pentaho)"]
1["SriSatish Ambati (CEO of 0xData)"]
1["Databricks Press Office"]
1["Arsalan Tavakoli-Shiraji"]
1["Arsalan Tavakoli-Shiraji"]
1["Databricks Press Office"]
1["Databricks Press Office"]
1["Arsalan Tavakoli-Shiraji"]
1["Reynold Xin"]
1["Ion Stoica"]
1["Xiangrui Meng"]
1["Matei Zaharia"]
3["Burak Yavuz","Xiangrui Meng","Reynold Xin"]
2["Li Pu","Reza Zadeh"]
1["Scott Walent"]
1["Oscar Mendez (CEO of Stratio)"]
2["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"]
4["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]
1["Patrick Wendell"]
3["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]
2["Burak Yavuz","Xiangrui Meng"]
1["Gavin Targonski (Product Management at Talend)"]
2["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"]
1["Vida Ha"]
2["John Tripier","Paco Nathan"]
1["Christopher Burdorf (Senior Software Engineer at NBC Universal)"]
2["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"]
1["Eric Carr (VP Core Systems Group at Guavus)"]
1["Jeremy Freeman (Freeman Lab)"]
1["Russell Cardullo (Sharethrough)"]
1["Sean Kandel (CTO at Trifacta)"]
1["Reynold Xin"]
1["Reza Zadeh"]
1["Jeff Feng (Product Manager at Tableau Software)"]
1["Scott Walent"]
2["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"]
1["John Kreisa (VP of Strategic Marketing at Hortonworks)"]
1["Sachin Chawla (VP of Engineering)"]
1["Sonal Goyal (CEO)"]
1[" Dibyendu Bhattacharya (Big Data Architect)"]
1["Reynold Xin"]
1["Matt MacKinnon (Director of Product Management at Zaloni)"]
2["John Tripier","Paco Nathan"]
3["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"]
2["Ameet Talwalkar","Anthony Joseph"]
1["Lieven Gesquiere (Virdata Lead Core R&D)"]
1["by Databricks Press Office"]
1["Kavitha Mariappan"]
1["Kavitha Mariappan"]
1["Yin Huai (Databricks)"]
1["Jeremy Freeman (Howard Hughes Medical Institute)"]
1["Dave Wang (Databricks)"]
1["Patrick Wendell"]
2["Xiangrui Meng","Patrick Wendell"]
4["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"]
1["Michael Armbrust"]
2["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"]
1["Tathagata Das"]
1["Kavitha Mariappan"]
4["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"]
-1null
-1null
-1null
-1null
-1null
-1null

Pull the first element from the array authors using an array subscript operator.

For example, in Scala, the 0th element of array authors is authors(0) whereas, in Python, the 0th element of authors is authors[0].

display(databricksBlogDF.select(col("authors")[0].alias("primaryAuthor")))
Tomer Shiran (VP of Product Management at MapR)
Tathagata Das
Steven Hillion
Michael Armbrust
Patrick Wendell
Ali Ghodsi
Russell Cardullo (Data Infrastructure Engineer at Sharethrough)
Jai Ranganathan
Databricks Press Office
Ion Stoica
Ahir Reddy
Pat McDonough
Ion Stoica
Patrick Wendell
Andy Konwinski
Pat McDonough
Ion Stoica
Matei Zaharia
Ion Stoica
Arsalan Tavakoli-Shiraji
Prashant Sharma
Databricks Training Team
Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)
Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)
Patrick Wendell
Michael Armbrust
Michael Hiskey (VP at MicroStrategy Inc.)
Christopher Nguyen (CEO &amp; Co-Founder of Adatao)
Databricks Press Office
Dean Wampler (Typesafe)
Hari Kodakalla (EVP at Apervi Inc.)
Bill Kehoe (Big Data Architect at Qlik)
Databricks Press Office
Costin Leau (Engineer at Elasticsearch)
Jake Cornelius (SVP of Product Management at Pentaho)
SriSatish Ambati (CEO of 0xData)
Databricks Press Office
Arsalan Tavakoli-Shiraji
Arsalan Tavakoli-Shiraji
Databricks Press Office
Databricks Press Office
Arsalan Tavakoli-Shiraji
Reynold Xin
Ion Stoica
Xiangrui Meng
Matei Zaharia
Burak Yavuz
Li Pu
Scott Walent
Oscar Mendez (CEO of Stratio)
Andy Huang (Alibaba Taobao Data Mining Team)
Doris Xin
Patrick Wendell
Arsalan Tavakoli-Shiraji
Burak Yavuz
Gavin Targonski (Product Management at Talend)
Nick Pentreath (Graphflow)
Vida Ha
John Tripier
Christopher Burdorf (Senior Software Engineer at NBC Universal)
Manish Amde (Origami Logic)
Eric Carr (VP Core Systems Group at Guavus)
Jeremy Freeman (Freeman Lab)
Russell Cardullo (Sharethrough)
Sean Kandel (CTO at Trifacta)
Reynold Xin
Reza Zadeh
Jeff Feng (Product Manager at Tableau Software)
Scott Walent
Ari Himmel (CEO at Faimdata)
John Kreisa (VP of Strategic Marketing at Hortonworks)
Sachin Chawla (VP of Engineering)
Sonal Goyal (CEO)
Dibyendu Bhattacharya (Big Data Architect)
Reynold Xin
Matt MacKinnon (Director of Product Management at Zaloni)
John Tripier
Luis Quintela (Sr. Manager of Big Data Analytics)
Ameet Talwalkar
Lieven Gesquiere (Virdata Lead Core R&D)
by Databricks Press Office
Kavitha Mariappan
Kavitha Mariappan
Yin Huai (Databricks)
Jeremy Freeman (Howard Hughes Medical Institute)
Dave Wang (Databricks)
Patrick Wendell
Xiangrui Meng
Xiangrui Meng
Michael Armbrust
Joseph K. Bradley (Databricks)
Tathagata Das
Kavitha Mariappan
Holden Karau
null
null
null
null
null
null

Explode

The explode method allows you to split an array column into multiple rows, copying all the other columns into each new row.

For example, split the column authors into the column author, with one author per row.

from pyspark.sql.functions import explode
display(databricksBlogDF.select("title","authors",explode(col("authors")).alias("author"), "link"))
MapR Integrates the Complete Apache Spark Stack["Tomer Shiran (VP of Product Management at MapR)"]Tomer Shiran (VP of Product Management at MapR)https://databricks.com/blog/2014/04/10/mapr-integrates-spark-stack.html
Apache Spark 0.9.1 Released["Tathagata Das"]Tathagata Dashttps://databricks.com/blog/2014/04/09/spark-0_9_1-released.html
Application Spotlight: Alpine Data Labs["Steven Hillion"]Steven Hillionhttps://databricks.com/blog/2014/03/31/application-spotlight-alpine.html
Spark SQL: Manipulating Structured Data Using Apache Spark["Michael Armbrust","Reynold Xin"]Michael Armbrusthttps://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Spark SQL: Manipulating Structured Data Using Apache Spark["Michael Armbrust","Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Apache Spark 0.9.0 Released["Patrick Wendell"]Patrick Wendellhttps://databricks.com/blog/2014/02/03/release-0_9_0.html
Apache Spark In MapReduce (SIMR)["Ali Ghodsi","Ahir Reddy"]Ali Ghodsihttps://databricks.com/blog/2014/01/01/simr.html
Apache Spark In MapReduce (SIMR)["Ali Ghodsi","Ahir Reddy"]Ahir Reddyhttps://databricks.com/blog/2014/01/01/simr.html
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"]Russell Cardullo (Data Infrastructure Engineer at Sharethrough)https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"]Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html
Apache Spark: A Delight for Developers["Jai Ranganathan","Matei Zaharia"]Jai Ranganathanhttps://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html
Apache Spark: A Delight for Developers["Jai Ranganathan","Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html
Databricks announces "Certified on Apache Spark" Program["Databricks Press Office"]Databricks Press Officehttps://databricks.com/blog/2014/03/18/spark-certification.html
Apache Spark Now a Top-level Apache Project["Ion Stoica"]Ion Stoicahttps://databricks.com/blog/2014/03/02/spark-apache-top-level-project.html
AMPLab updates the Big Data Benchmark["Ahir Reddy","Reynold Xin"]Ahir Reddyhttps://databricks.com/blog/2014/02/12/big-data-benchmark.html
AMPLab updates the Big Data Benchmark["Ahir Reddy","Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/02/12/big-data-benchmark.html
Databricks at the O'Reilly Strata Conference 2014["Pat McDonough"]Pat McDonoughhttps://databricks.com/blog/2014/02/10/strata-santa-clara-2014.html
Apache Spark and Hadoop: Working Together["Ion Stoica"]Ion Stoicahttps://databricks.com/blog/2014/01/21/spark-and-hadoop.html
Apache Spark 0.8.1 Released["Patrick Wendell"]Patrick Wendellhttps://databricks.com/blog/2013/12/19/release-0_8_1.html
Highlights From Spark Summit 2013["Andy Konwinski"]Andy Konwinskihttps://databricks.com/blog/2013/12/18/spark-summit-2013-follow-up.html
Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications["Pat McDonough"]Pat McDonoughhttps://databricks.com/blog/2013/11/21/putting-spark-to-use.html
Databricks and Cloudera Partner to Support Apache Spark["Ion Stoica"]Ion Stoicahttps://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html
The Growing Apache Spark Community["Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2013/10/27/the-growing-spark-community.html
Databricks and the Apache Spark Platform["Ion Stoica","Matei Zaharia"]Ion Stoicahttps://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html
Databricks and the Apache Spark Platform["Ion Stoica","Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html
Databricks and MapR["Arsalan Tavakoli-Shiraji"]Arsalan Tavakoli-Shirajihttps://databricks.com/blog/2014/04/10/partnership-between-databricks-and-mapr.html
Making Apache Spark Easier to Use in Java with Java 8["Prashant Sharma","Matei Zaharia"]Prashant Sharmahttps://databricks.com/blog/2014/04/14/spark-with-java-8.html
Making Apache Spark Easier to Use in Java with Java 8["Prashant Sharma","Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2014/04/14/spark-with-java-8.html
Databricks Announces Apache Spark Training Workshops["Databricks Training Team"]Databricks Training Teamhttps://databricks.com/blog/2014/06/02/databricks-hands-on-technical-workshops.html
Application Spotlight: Atigeo xPatterns["Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)"]Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)https://databricks.com/blog/2014/05/22/application-spotlight-atigeo-xpatterns.html
Pivotal Hadoop Integrates the Full Apache Spark Stack["Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)"]Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)https://databricks.com/blog/2014/05/23/pivotal-hadoop-integrates-the-full-apache-spark-stack.html
Announcing Apache Spark 1.0["Patrick Wendell"]Patrick Wendellhttps://databricks.com/blog/2014/05/30/announcing-spark-1-0.html
Exciting Performance Improvements on the Horizon for Spark SQL["Michael Armbrust","Zongheng Yang"]Michael Armbrusthttps://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
Exciting Performance Improvements on the Horizon for Spark SQL["Michael Armbrust","Zongheng Yang"]Zongheng Yanghttps://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
MicroStrategy "Certified on Apache Spark"["Michael Hiskey (VP at MicroStrategy Inc.)"]Michael Hiskey (VP at MicroStrategy Inc.)https://databricks.com/blog/2014/06/04/microstrategy-certified-on-spark.html
Application Spotlight: Arimo["Christopher Nguyen (CEO &amp; Co-Founder of Adatao)"]Christopher Nguyen (CEO &amp; Co-Founder of Adatao)https://databricks.com/blog/2014/06/11/application-spotlight-arimo.html
Spark Summit 2014 Brings Together Apache Spark Community["Databricks Press Office"]Databricks Press Officehttps://databricks.com/blog/2014/06/11/spark-summit-2014-brings-together-apache-spark-community.html
Application Spotlight: Lightbend["Dean Wampler (Typesafe)"]Dean Wampler (Typesafe)https://databricks.com/blog/2014/06/13/application-spotlight-lightbend.html
Application Spotlight: Apervi["Hari Kodakalla (EVP at Apervi Inc.)"]Hari Kodakalla (EVP at Apervi Inc.)https://databricks.com/blog/2014/06/23/application-spotlight-apervi.html
Application Spotlight: Qlik["Bill Kehoe (Big Data Architect at Qlik)"]Bill Kehoe (Big Data Architect at Qlik)https://databricks.com/blog/2014/06/24/application-spotlight-qlik.html
Databricks Launches "Certified Apache Spark Distribution" Program["Databricks Press Office"]Databricks Press Officehttps://databricks.com/blog/2014/06/26/databricks-launches-certified-spark-distribution-program.html
Application Spotlight: Elasticsearch["Costin Leau (Engineer at Elasticsearch)"]Costin Leau (Engineer at Elasticsearch)https://databricks.com/blog/2014/06/27/application-spotlight-elasticsearch.html
Application Spotlight: Pentaho["Jake Cornelius (SVP of Product Management at Pentaho)"]Jake Cornelius (SVP of Product Management at Pentaho)https://databricks.com/blog/2014/06/30/application-spotlight-pentaho.html
Sparkling Water = H20 + Apache Spark["SriSatish Ambati (CEO of 0xData)"]SriSatish Ambati (CEO of 0xData)https://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html
Databricks Unveils Apache Spark-Based Cloud Platform; Announces Series B Funding["Databricks Press Office"]Databricks Press Officehttps://databricks.com/blog/2014/06/30/databricks-unveils-spark-based-cloud-platform.html
Databricks Application Spotlight at Spark Summit 2014["Arsalan Tavakoli-Shiraji"]Arsalan Tavakoli-Shirajihttps://databricks.com/blog/2014/04/28/databricks-application-spotlight-at-spark-summit-2014.html
Databricks and Datastax["Arsalan Tavakoli-Shiraji"]Arsalan Tavakoli-Shirajihttps://databricks.com/blog/2014/05/08/databricks-and-datastax.html
Databricks Partners with Simba to Deliver Shark ODBC Driver["Databricks Press Office"]Databricks Press Officehttps://databricks.com/blog/2014/04/30/databricks-partners-with-simba-to-deliver-shark-odbc-driver.html
Databricks Announces Partnership with SAP["Databricks Press Office"]Databricks Press Officehttps://databricks.com/blog/2014/07/01/databricks-announces-partnership-with-sap.html
Integrating Apache Spark and HANA["Arsalan Tavakoli-Shiraji"]Arsalan Tavakoli-Shirajihttps://databricks.com/blog/2014/07/01/integrating-spark-and-hana.html
Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark["Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
Databricks: Making Big Data Easy["Ion Stoica"]Ion Stoicahttps://databricks.com/blog/2014/07/14/databricks-cloud-making-big-data-easy.html
New Features in MLlib in Apache Spark 1.0["Xiangrui Meng"]Xiangrui Menghttps://databricks.com/blog/2014/07/16/new-features-in-mllib-in-spark-1-0.html
The State of Apache Spark in 2014["Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2014/07/18/the-state-of-apache-spark-in-2014.html
Scalable Collaborative Filtering with Apache Spark MLlib["Burak Yavuz","Xiangrui Meng","Reynold Xin"]Burak Yavuzhttps://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
Scalable Collaborative Filtering with Apache Spark MLlib["Burak Yavuz","Xiangrui Meng","Reynold Xin"]Xiangrui Menghttps://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
Scalable Collaborative Filtering with Apache Spark MLlib["Burak Yavuz","Xiangrui Meng","Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
Distributing the Singular Value Decomposition with Apache Spark["Li Pu","Reza Zadeh"]Li Puhttps://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html
Distributing the Singular Value Decomposition with Apache Spark["Li Pu","Reza Zadeh"]Reza Zadehhttps://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html
Spark Summit 2014 Highlights["Scott Walent"]Scott Walenthttps://databricks.com/blog/2014/07/22/spark-summit-2014-highlights.html
When Stratio Met Apache Spark: A True Love Story["Oscar Mendez (CEO of Stratio)"]Oscar Mendez (CEO of Stratio)https://databricks.com/blog/2014/08/08/when-stratio-met-spark-a-true-love-story.html
Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"]Andy Huang (Alibaba Taobao Data Mining Team)https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html
Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"]Wei Wu (Alibaba Taobao Data Mining Team)https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html
Statistics Functionality in Apache Spark 1.1["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]Doris Xinhttps://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html
Statistics Functionality in Apache Spark 1.1["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]Burak Yavuzhttps://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html
Statistics Functionality in Apache Spark 1.1["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]Xiangrui Menghttps://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html
Statistics Functionality in Apache Spark 1.1["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]Hossein Falakihttps://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html
Announcing Apache Spark 1.1["Patrick Wendell"]Patrick Wendellhttps://databricks.com/blog/2014/09/11/announcing-spark-1-1.html
Apache Spark 1.1: The State of Spark Streaming["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]Arsalan Tavakoli-Shirajihttps://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html
Apache Spark 1.1: The State of Spark Streaming["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]Tathagata Dashttps://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html
Apache Spark 1.1: The State of Spark Streaming["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]Patrick Wendellhttps://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html
Apache Spark 1.1: MLlib Performance Improvements["Burak Yavuz","Xiangrui Meng"]Burak Yavuzhttps://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html
Apache Spark 1.1: MLlib Performance Improvements["Burak Yavuz","Xiangrui Meng"]Xiangrui Menghttps://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html
Application Spotlight: Talend["Gavin Targonski (Product Management at Talend)"]Gavin Targonski (Product Management at Talend)https://databricks.com/blog/2014/09/15/application-spotlight-talend.html
Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"]Nick Pentreath (Graphflow)https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html
Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"]Kan Zhang (IBM)https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html
Databricks Reference Applications["Vida Ha"]Vida Hahttps://databricks.com/blog/2014/09/23/databricks-reference-applications.html
Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers["John Tripier","Paco Nathan"]John Tripierhttps://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html
Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers["John Tripier","Paco Nathan"]Paco Nathanhttps://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html
Apache Spark Improves the Economics of Video Distribution at NBC Universal["Christopher Burdorf (Senior Software Engineer at NBC Universal)"]Christopher Burdorf (Senior Software Engineer at NBC Universal)https://databricks.com/blog/2014/09/24/apache-spark-improves-the-economics-of-video-distribution-at-nbc-universal.html
Scalable Decision Trees in MLlib["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"]Manish Amde (Origami Logic)https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html
Scalable Decision Trees in MLlib["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"]Joseph Bradley (Databricks)https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html
Guavus Embeds Apache Spark into its Operational Intelligence Platform Deployed at the World’s Largest Telcos["Eric Carr (VP Core Systems Group at Guavus)"]Eric Carr (VP Core Systems Group at Guavus)https://databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcos.html
Apache Spark as a platform for large-scale neuroscience["Jeremy Freeman (Freeman Lab)"]Jeremy Freeman (Freeman Lab)https://databricks.com/blog/2014/10/01/spark-as-a-platform-for-large-scale-neuroscience.html
Sharethrough Uses Apache Spark Streaming to Optimize Advertisers' Return on Marketing Investment["Russell Cardullo (Sharethrough)"]Russell Cardullo (Sharethrough)https://databricks.com/blog/2014/10/07/sharethrough-uses-spark-streaming-to-optimize-advertisers-return-on-marketing-investment.html
Application Spotlight: Trifacta["Sean Kandel (CTO at Trifacta)"]Sean Kandel (CTO at Trifacta)https://databricks.com/blog/2014/10/09/application-spotlight-trifacta.html
Apache Spark the fastest open source engine for sorting a petabyte["Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
Efficient similarity algorithm now in Apache Spark, thanks to Twitter["Reza Zadeh"]Reza Zadehhttps://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
Application Spotlight: Tableau Software["Jeff Feng (Product Manager at Tableau Software)"]Jeff Feng (Product Manager at Tableau Software)https://databricks.com/blog/2014/10/15/application-spotlight-tableau-software.html
Spark Summit East - CFP now open["Scott Walent"]Scott Walenthttps://databricks.com/blog/2014/10/23/spark-summit-east-cfp-now-open.html
Application Spotlight: Faimdata["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"]Ari Himmel (CEO at Faimdata)https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html
Application Spotlight: Faimdata["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"]Nan Zhu (Chief Architect at Faimdata)https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html
Hortonworks: A shared vision for Apache Spark on Hadoop["John Kreisa (VP of Strategic Marketing at Hortonworks)"]John Kreisa (VP of Strategic Marketing at Hortonworks)https://databricks.com/blog/2014/10/31/hortonworks-a-shared-vision-for-apache-spark-on-hadoop.html
Application Spotlight: Skytree Infinity["Sachin Chawla (VP of Engineering)"]Sachin Chawla (VP of Engineering)https://databricks.com/blog/2014/11/24/application-spotlight-skytree-infinity.html
Application Spotlight: Nube Reifier["Sonal Goyal (CEO)"]Sonal Goyal (CEO)https://databricks.com/blog/2014/12/02/application-spotlight-nube-reifier.html
Pearson uses Apache Spark Streaming for next generation adaptive learning platform[" Dibyendu Bhattacharya (Big Data Architect)"] Dibyendu Bhattacharya (Big Data Architect)https://databricks.com/blog/2014/12/08/pearson-uses-spark-streaming-for-next-generation-adaptive-learning-platform.html
Apache Spark officially sets a new record in large-scale sorting["Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Application Spotlight: Bedrock["Matt MacKinnon (Director of Product Management at Zaloni)"]Matt MacKinnon (Director of Product Management at Zaloni)https://databricks.com/blog/2014/11/14/application-spotlight-bedrock.html
The Apache Spark Certified Developer Program["John Tripier","Paco Nathan"]John Tripierhttps://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html
The Apache Spark Certified Developer Program["John Tripier","Paco Nathan"]Paco Nathanhttps://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html

It's more obvious to restrict the output to articles that have multiple authors, and then sort by the title.

databricksBlog2DF = (databricksBlogDF 
  .select("title","authors",explode(col("authors")).alias("author"), "link") 
  .filter(size(col("authors")) > 1) 
  .orderBy("title")
)

display(databricksBlog2DF)
"Learning Spark" book available from O'Reilly["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
"Learning Spark" book available from O'Reilly["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"]Holden Karauhttps://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
"Learning Spark" book available from O'Reilly["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"]Andy Konwinskihttps://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
"Learning Spark" book available from O'Reilly["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"]Patrick Wendellhttps://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
AMPLab updates the Big Data Benchmark["Ahir Reddy","Reynold Xin"]Ahir Reddyhttps://databricks.com/blog/2014/02/12/big-data-benchmark.html
AMPLab updates the Big Data Benchmark["Ahir Reddy","Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/02/12/big-data-benchmark.html
Announcing Apache Spark Packages["Xiangrui Meng","Patrick Wendell"]Patrick Wendellhttps://databricks.com/blog/2014/12/22/announcing-spark-packages.html
Announcing Apache Spark Packages["Xiangrui Meng","Patrick Wendell"]Xiangrui Menghttps://databricks.com/blog/2014/12/22/announcing-spark-packages.html
Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"]Nick Pentreath (Graphflow)https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html
Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"]Kan Zhang (IBM)https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html
Apache Spark 1.1: MLlib Performance Improvements["Burak Yavuz","Xiangrui Meng"]Burak Yavuzhttps://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html
Apache Spark 1.1: MLlib Performance Improvements["Burak Yavuz","Xiangrui Meng"]Xiangrui Menghttps://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html
Apache Spark 1.1: The State of Spark Streaming["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]Arsalan Tavakoli-Shirajihttps://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html
Apache Spark 1.1: The State of Spark Streaming["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]Tathagata Dashttps://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html
Apache Spark 1.1: The State of Spark Streaming["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"]Patrick Wendellhttps://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html
Apache Spark In MapReduce (SIMR)["Ali Ghodsi","Ahir Reddy"]Ali Ghodsihttps://databricks.com/blog/2014/01/01/simr.html
Apache Spark In MapReduce (SIMR)["Ali Ghodsi","Ahir Reddy"]Ahir Reddyhttps://databricks.com/blog/2014/01/01/simr.html
Apache Spark: A Delight for Developers["Jai Ranganathan","Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html
Apache Spark: A Delight for Developers["Jai Ranganathan","Matei Zaharia"]Jai Ranganathanhttps://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html
Application Spotlight: Faimdata["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"]Ari Himmel (CEO at Faimdata)https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html
Application Spotlight: Faimdata["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"]Nan Zhu (Chief Architect at Faimdata)https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html
Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers["John Tripier","Paco Nathan"]Paco Nathanhttps://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html
Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers["John Tripier","Paco Nathan"]John Tripierhttps://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html
Databricks and the Apache Spark Platform["Ion Stoica","Matei Zaharia"]Ion Stoicahttps://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html
Databricks and the Apache Spark Platform["Ion Stoica","Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html
Databricks to run two massive online courses on Apache Spark["Ameet Talwalkar","Anthony Joseph"]Ameet Talwalkarhttps://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html
Databricks to run two massive online courses on Apache Spark["Ameet Talwalkar","Anthony Joseph"]Anthony Josephhttps://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html
Distributing the Singular Value Decomposition with Apache Spark["Li Pu","Reza Zadeh"]Reza Zadehhttps://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html
Distributing the Singular Value Decomposition with Apache Spark["Li Pu","Reza Zadeh"]Li Puhttps://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html
Exciting Performance Improvements on the Horizon for Spark SQL["Michael Armbrust","Zongheng Yang"]Michael Armbrusthttps://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
Exciting Performance Improvements on the Horizon for Spark SQL["Michael Armbrust","Zongheng Yang"]Zongheng Yanghttps://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
ML Pipelines: A New High-Level API for MLlib["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"]Shivaram Venkataraman (UC Berkeley)https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
ML Pipelines: A New High-Level API for MLlib["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"]Evan Sparks (UC Berkeley)https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
ML Pipelines: A New High-Level API for MLlib["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"]Xiangrui Menghttps://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
ML Pipelines: A New High-Level API for MLlib["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"]Joseph Bradleyhttps://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
Making Apache Spark Easier to Use in Java with Java 8["Prashant Sharma","Matei Zaharia"]Prashant Sharmahttps://databricks.com/blog/2014/04/14/spark-with-java-8.html
Making Apache Spark Easier to Use in Java with Java 8["Prashant Sharma","Matei Zaharia"]Matei Zahariahttps://databricks.com/blog/2014/04/14/spark-with-java-8.html
Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"]Andy Huang (Alibaba Taobao Data Mining Team)https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html
Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"]Wei Wu (Alibaba Taobao Data Mining Team)https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html
Random Forests and Boosting in MLlib["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"]Manish Amde (Origami Logic)https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html
Random Forests and Boosting in MLlib["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"]Joseph K. Bradley (Databricks)https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html
Samsung SDS uses Apache Spark for prescriptive analytics at large scale["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"]Luis Quintela (Sr. Manager of Big Data Analytics)https://databricks.com/blog/2014/11/21/samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale.html
Samsung SDS uses Apache Spark for prescriptive analytics at large scale["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"]Yan Breek (Data Scientist)https://databricks.com/blog/2014/11/21/samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale.html
Samsung SDS uses Apache Spark for prescriptive analytics at large scale["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"]Girish Kathalagiri (Data Analytics Engineer)https://databricks.com/blog/2014/11/21/samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale.html
Scalable Collaborative Filtering with Apache Spark MLlib["Burak Yavuz","Xiangrui Meng","Reynold Xin"]Xiangrui Menghttps://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
Scalable Collaborative Filtering with Apache Spark MLlib["Burak Yavuz","Xiangrui Meng","Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
Scalable Collaborative Filtering with Apache Spark MLlib["Burak Yavuz","Xiangrui Meng","Reynold Xin"]Burak Yavuzhttps://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
Scalable Decision Trees in MLlib["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"]Joseph Bradley (Databricks)https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html
Scalable Decision Trees in MLlib["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"]Manish Amde (Origami Logic)https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"]Russell Cardullo (Data Infrastructure Engineer at Sharethrough)https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html
Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"]Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html
Spark SQL: Manipulating Structured Data Using Apache Spark["Michael Armbrust","Reynold Xin"]Michael Armbrusthttps://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Spark SQL: Manipulating Structured Data Using Apache Spark["Michael Armbrust","Reynold Xin"]Reynold Xinhttps://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Statistics Functionality in Apache Spark 1.1["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]Doris Xinhttps://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html
Statistics Functionality in Apache Spark 1.1["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]Burak Yavuzhttps://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html
Statistics Functionality in Apache Spark 1.1["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]Xiangrui Menghttps://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html
Statistics Functionality in Apache Spark 1.1["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"]Hossein Falakihttps://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html
The Apache Spark Certified Developer Program["John Tripier","Paco Nathan"]John Tripierhttps://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html
The Apache Spark Certified Developer Program["John Tripier","Paco Nathan"]Paco Nathanhttps://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html

Exercise 1

Identify all the articles written or co-written by Michael Armbrust.

# TODO
from pyspark.sql.functions import array_contains
articlesByMichaelDF = # FILL_IN
# TEST - Run this cell to test your solution.

from pyspark.sql import Row

resultsCount = articlesByMichaelDF.count()
dbTest("DF-L5-articlesByMichael-count", 3, resultsCount)  

results = articlesByMichaelDF.collect()

dbTest("DF-L5-articlesByMichael-0", Row(title=u'Spark SQL: Manipulating Structured Data Using Apache Spark'), results[0])
dbTest("DF-L5-articlesByMichael-1", Row(title=u'Exciting Performance Improvements on the Horizon for Spark SQL'), results[1])
dbTest("DF-L5-articlesByMichael-2", Row(title=u'Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform'), results[2])

print("Tests passed!")

Step 2

Show the list of Michael Armbrust's articles in HTML format.

Exercise 2

Identify the complete set of categories used in the Databricks blog articles.

Step 1

Starting with the databricksBlogDF DataFrame, create another DataFrame called uniqueCategoriesDF where:

  1. The data set contains the one column category (and no others).
  2. This list of categories should be unique.
# TODO
uniqueCategoriesDF = # FILL_IN
# TEST - Run this cell to test your solution.

resultsCount =  uniqueCategoriesDF.count()

dbTest("DF-L5-uniqueCategories-count", 12, resultsCount)

results = uniqueCategoriesDF.collect()

dbTest("DF-L5-uniqueCategories-0", Row(category=u'Announcements'), results[0])
dbTest("DF-L5-uniqueCategories-1", Row(category=u'Apache Spark'), results[1])
dbTest("DF-L5-uniqueCategories-2", Row(category=u'Company Blog'), results[2])

dbTest("DF-L5-uniqueCategories-9", Row(category=u'Platform'), results[9])
dbTest("DF-L5-uniqueCategories-10", Row(category=u'Product'), results[10])
dbTest("DF-L5-uniqueCategories-11", Row(category=u'Streaming'), results[11])

print("Tests passed!")

Step 2

Show the complete list of categories.

# TODO

FILL_IN

Exercise 3

Count how many times each category is referenced in the Databricks blog.

# TODO

from pyspark.sql.functions import count
totalArticlesByCategoryDF = # FILL_IN
# TEST - Run this cell to test your solution.

results = totalArticlesByCategoryDF.count()

dbTest("DF-L5-articlesByCategory-count", 12, results)

print("Tests passed!")
# TEST - Run this cell to test your solution.

results = totalArticlesByCategoryDF.collect()

dbTest("DF-L5-articlesByCategory-0", Row(category=u'Announcements', total=72), results[0])
dbTest("DF-L5-articlesByCategory-1", Row(category=u'Apache Spark', total=132), results[1])
dbTest("DF-L5-articlesByCategory-2", Row(category=u'Company Blog', total=224), results[2])

dbTest("DF-L5-articlesByCategory-9", Row(category=u'Platform', total=4), results[9])
dbTest("DF-L5-articlesByCategory-10", Row(category=u'Product', total=83), results[10])
dbTest("DF-L5-articlesByCategory-11", Row(category=u'Streaming', total=21), results[11])

print("Tests passed!")

Step 2

Display the totals of each category in html format (should be ordered by category).

# TODO

FILL_IN

Summary

  • Spark DataFrames allows you to query and manipulate structured and semi-structured data.
  • Spark DataFrames built-in functions provide powerful primitives for querying complex schemas.

Review Questions

Q: What is the syntax for accessing nested columns?
A: Use the dot notation: select("dates.publishedOn")

Q: What is the syntax for accessing the first element in an array?
A: Use the [subscript] notation: select("col(authors)[0]")

Q: What is the syntax for expanding an array into multiple rows?
A: Use the explode method: select(explode(col("authors")).alias("Author"))

Next Steps

Start the next lesson, Querying Data Lakes with DataFrames.

Additional Topics & Resources